Sand.ai secures over $100 million in funding and plans to launch an open-source MoE video model in July 2026.

ME AI News, according to monitoring by Beating, video generation large model company Sand.ai (founded in January 2024) has announced the completion of two funding rounds totaling over $100 million. Investors include Look Capital, Lollapalooza Capital (Wang Huiwen’s family office), Jiukun Venture Capital, Matrix Partners China, MSA Capital, Sinovation Ventures, Source Code Capital, IDG, Baidu Ventures, and other leading institutions. Starhan Capital served as the financial advisor for this round. Sand.ai’s founder, Cao Yue, stated in an interview that the team has consistently pursued the non-consensus autoregressive video generation approach rather than the mainstream Diffusion route. Their previously released Magi-1 model remains ranked first on Google DeepMind’s Physics-IQ physical realism benchmark. To break through the “cost, speed, quality” trilemma in video generation, Sand.ai shifted last year to explore the MoE (Mixture of Experts) architecture and plans to release a next-generation MoE-based video generation model in Q3 2026 (July), combining efficient inference with the largest parameter scale currently available in open-source models—and will open-source the model. On the commercialization front, Sand.ai employs a dual-driver strategy of models and products. Its music Agent product, VidMuse, launched in January this year, achieved $10 million in annual recurring revenue (ARR) within just two months. Additionally, its open-source MagiAttention operator library is now used by nearly all multimodal model teams in China and has received official endorsement from NVIDIA. Regarding the industry’s heated discussion on the “world model” concept, Cao Yue believes it is still in the pre-GPT era (before GPT-1), with neither data nor approaches having converged. He emphasized that video is the most critical data modality toward achieving world models and argued that models should autonomously learn physical laws by predicting raw video observations (pixels/frames), rather than introducing human priors to explicitly model state variables. (Source: BlockBeats)