Video generation models have recently showcased impressive results, generating high quality visual features with realistic physics and motion. Such video generators are intriguing for robotics because after fine-tuning to the robotic embodiment, have the potential to serve as generalizable world models and real-world simulators. Among video generation approaches, masked video transformers provide a computationally efficient alternative to diffusion-based methods. Building on recent successes of Mixture of Experts (MoE) in transformer architectures, we propose a novel approach to improve pre-trained robotic video transformers using sparsely gated MoE. Our method replaces the feedforward layers of the transformer block with sparely gated MoE layers. We also introduce an innovative weight initialization scheme that improves training convergence while fine-tuning masked video transformers. We evaluate our method on the 1xgpt humanoid robotic dataset, demonstrating improvements in both cross-entropy loss (0.07 reduction) and LPIPS scores (0.007 reduction). Our findings suggest that MoE-based fine-tuning with strategic weight initialization can enhance the performance of robotic video transformers while maintaining computational efficiency through sparse expert activation.
PDF