Meta Proposes MobileMoE, Achieves 3.8x Speedup on iPhone 16 Pro

icon MarsBit
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
Meta has unveiled MobileMoE, the first Mixture-of-Experts model deployed on commercial smartphones. On the iPhone 16 Pro, MobileMoE-S delivered a 3.8x speed boost during the input phase. The model matched or outperformed dense baselines in accuracy while using less computational power. It establishes a new Pareto frontier for edge-side LLMs, balancing precision and cost. This on-chain news underscores Meta’s push into mobile AI, with new token listings on exchanges potentially following as edge computing gains traction.

In recent years, Mixture of Experts (MoE) models have been widely adopted for large models in the cloud. However, on mobile devices, large language models (LLMs) still primarily rely on dense architectures. In the past, stricter constraints on memory, computational power, and latency on mobile devices meant that systematic research on edge-side MoE models within the sub-billion active parameter range was lacking. Today, with increased DRAM capacity in mobile devices, MoE models are beginning to become viable for deployment on smartphones.

The Meta team's MobileMoE achieves efficient MoE inference on commercial smartphones for the first time. Results show that, with comparable memory usage, MobileMoE-S/M matches or exceeds the average accuracy of dense baselines while using only 1/2 to 1/4 of the inference computation. In real-world tests, MobileMoE-S demonstrates the most significant speedup on the iPhone 16 Pro’s GPU/MLX backend, with input stages achieving up to 3.8x faster performance.

Meta

Paper link: https://arxiv.org/abs/2605.27358

The research team also proposed a set of edge-side MoE scaling laws to identify model architectures better suited for mobile deployment. MobileMoE establishes a new Pareto frontier for edge large language models, achieving superior trade-offs between accuracy and inference computational cost.

Meta

Figure | MobileMoE establishes a new Pareto frontier for on-device large language models.

How is MobileMoE designed?

MobileMoE can be understood as a type of MoE language model designed specifically for edge deployment. It retains the overall decoder-only Transformer architecture but replaces the original dense feed-forward layers with MoE layers. The router selects only the top-scoring few experts for each token to participate in computation, while a shared expert always participates. The entire training process consists of four stages: pre-training, mid-stage training, supervised fine-tuning, and quantization-aware training.

Pre-training: The research team pre-trained the model on approximately 6 trillion tokens of open-licensed data with a context length of 2048, primarily consisting of web content while also covering mathematics, code, knowledge, and science.

Mid-term training: The research team extended the context length to 8,192 and further increased the proportion of high-quality data in knowledge, code, mathematics, and science, bringing the total scale to approximately 500B tokens.

Supervised Fine-Tuning (SFT): The research team fine-tuned MobileMoE-Base on an open-licensed instruction fine-tuning dataset comprising over 80 million samples.

Quantization-Aware Training: The research team quantized linear layers and embeddings to INT4, applied dynamic quantization to activations at INT8, and retained FP32 precision for the router.

Meta

Figure | The four-stage training of MobileMoE.

Experimental results

Ablation study results

The research team first compared three architectural variables: the number of experts E, the expert granularity g, and whether shared experts were included.

Meta

Figure | Scaling of the number of experts E.

Under a fixed memory budget, the MoE loss falls below that of the corresponding dense model when memory exceeds approximately 0.25 GB. Continuing to increase the number of experts E further reduces the loss, but the marginal gains become significantly diminished once E reaches 8. Experiments with expert granularity g indicate that finer-grained expert configurations generally perform better, with g=8 achieving an optimal balance between performance and training cost; when g increases from 8 to 16, the loss improvement is less than 0.01, while training time increases by approximately 50%. Under the same computational budget, incorporating shared experts further reduces the model loss.

Based on the ablation study results, the research team ultimately adopted a configuration with E=8, g=8, and shared experts—namely, 60 fine-grained routing experts, Top-4 routing, and one shared expert—and applied this architecture to the three versions: MobileMoE-S, MobileMoE-M, and MobileMoE-L.

Meta

Figure | Scaling MoE models under optimal conditions.

Meta

Figure | Training efficiency of the MoE architecture.

14 foundational evaluations: Establishing a new edge-side Pareto frontier

The research team re-evaluated MobileMoE alongside models such as Gemma 3, SmolLM2, Qwen3.5, OLMo 2, and OLMoE-1B-7B under a unified setup across 14 foundational benchmarks spanning five categories: common sense reasoning, knowledge, science, reading, and reasoning.

Meta

Figure | Pretraining trajectory of MobileMoE.

Compared with baseline models, MobileMoE-M achieves a higher average score than Qwen3.5 2B, and MobileMoE-L achieves a higher average score than OLMoE-1B-7B, while requiring a smaller model size. The research team also noted that the baseline version of MobileMoE-L already outperforms the instruct version of OLMoE-1B-7B in average score. In terms of training scale, MobileMoE uses approximately 6T pre-training tokens, fewer than Llama 3.2 1B’s 9T and SmolLM2 1.7B’s 11T. In the overall comparison of instruction-tuned models, MobileMoE-M’s average accuracy is already close to that of OLMoE-1B-7B, while its active and total parameters are approximately 60% lower.

Meta

Figure | MobileMoE-Base Model Comparison.

Advanced evaluation: Advantages are more pronounced in code and mathematical tasks.

In advanced evaluations after instruction fine-tuning, MobileMoE demonstrates stronger performance on code and math tasks. For example, MobileMoE-L achieves higher average scores than Qwen3.5 2B and OLMoE-1B-7B in both code and math evaluations. However, the research team notes that Qwen3.5 2B still outperforms in instruction following and knowledge reasoning capabilities.

Meta

Figure | Instruct model comparison on advanced benchmarks.

Quantization and Edge Deployment: Maintains competitiveness even after INT4 quantization, with noticeable speedup on mobile devices.

After quantization, the overall average scores of MobileMoE-S/M/L decreased slightly compared to their respective BF16 versions, but the decline was approximately 2 to 3 points. Even so, the INT4 version of MobileMoE-L still outperforms the BF16 version of OLMoE-1B-7B Instruct.

The research team also deployed MobileMoE on the Samsung Galaxy S25 and iPhone 16 Pro for testing. Results showed that, under comparable INT4 weight memory conditions, MobileMoE-S achieved 1.8–3.8x faster input processing and 2.2–3.4x faster token-by-token generation compared to MobileLLM-Pro.

In terms of memory usage, under conditions of the Samsung Galaxy S25, 8K context, and real prompts, MobileMoE-S achieves a peak RSS of 1.49 GB, lower than MobileLLM-Pro’s 1.91 GB.

Meta

Figure | Edge-side runtime latency.

Shortcomings and Future Directions

Currently, instruction-tuned MobileMoE still lags behind Qwen3.5 2B in higher-order instruction following, knowledge, and reasoning capabilities. The research team believes this gap may be due to more refined post-training. To close this gap in the future, training efforts need to strengthen distillation, reasoning-oriented post-training, and multimodal expansion.

In addition, the research team noted that the memory usage of MoE on mobile devices varies with input content. Compared to fixed template inputs, real-world inputs typically result in higher memory usage. Testing solely based on templated inputs may underestimate the memory pressure encountered in actual deployment scenarios. In the future, to more accurately evaluate the real-world memory performance of on-device MoE, further evaluation based on additional real-world test data is still required.

Meanwhile, the research team has completed systematic real-device testing on CPU and GPU backends, but the NPU pathway remains to be explored. Additionally, the runtime memory footprint of MoE is sensitive to input content. In the future, dynamic routing, expert pruning, mixed-precision quantization, and NPU deployment on mobile devices will be key directions for further improving on-device efficiency.

For more technical details, please refer to the original paper.

This article is from the WeChat public account "Academic Headline" (ID: SciTouTiao), authored by Xia Qiansi.

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.