NVIDIA Open-Sources MoE Optimization Tool, Speeds Up Fine-Tuning by 3.7x

icon MarsBit
Share
AI summary iconSummary

One line of import speeds up MoE large model fine-tuning by 3.7x.

NVIDIA's latest research is now open-source: NeMo AutoModel, designed specifically for building and fine-tuning large-scale generative AI models.

Based on Hugging Face Transformers v5, NeMo AutoModel enables faster fine-tuning of MoE models with just one additional import statement, without modifying the code API.

NeMo AutoModel

Experiments show that NVIDIA NeMo AutoModel achieves a 3.4x to 3.7x increase in training throughput and reduces GPU memory usage by 29% to 32% compared to the original Hugging Face Transformers v5 during MoE fine-tuning.

On a single-node system with 8x H100 80GB GPUs, using Qwen3-30B-A3B as an example, NeMo AutoModel increased the TPS/GPU (throughput per GPU per second) from 3,075 to 11,340, achieving a 3.69x improvement.

Core Technology Analysis

MoE has become the dominant architecture for state-of-the-art models, but it also introduces new challenges for efficient training:

Expert parallelization, communication fusion, kernel optimization—these complex engineering tasks require supporting infrastructure.

Hugging Face's Transformers v5 is currently one of the most widely used general-purpose foundations for MoE training. Version 5 enhances native MoE support by introducing core MoE capabilities such as expert backends, dynamic weight loading, and distributed execution.

NeMo AutoModel

This time, NVIDIA’s approach is to build on the work of predecessors by supporting the Hugging Face Transformers API, enabling users to achieve higher training throughput and lower GPU memory usage in MoE fine-tuning with minimal code changes.

Specifically, NeMo AutoModel builds on Transformers v5 by adding Expert Parallelism (EP), DeepEP, and TransformerEngine.

Expert Parallelism

Expert parallel technology is primarily used to reduce memory pressure.

EP distributes expert weights across multiple GPUs, so that each GPU no longer holds all experts in full, but only a portion of their parameters.

For example, with 8 GPUs and ep_size=8, the expert weights are distributed across the 8 GPUs, reducing the MoE memory usage per GPU to 1/8 of the original.

Based on the experimental results, this technology reduces the peak memory usage for Qwen3 from 68.2 GiB to 48.1 GiB, a 29% reduction.

For the Nemotron Nanomo model, memory usage has decreased from 62.1 GiB to 42.5 GiB, a reduction of 32%.

The freed-up space can be used to support larger batches and longer sequences.

NeMo AutoModel

DeepEP

DeepEP achieves the integration of computation and communication.

In traditional approaches, there is significant communication overhead between token distribution and expert computation. DeepEP integrates token distribution and combination operations into optimized GPU kernels, enabling overlap between communication and expert computation.

TransformerEngine

The TransformerEngine kernel accelerates various core operations.

This technology integrates implementations such as attention mechanisms, linear layers, and RMSNorm, accelerating not only MoE layers but also standard Transformer layers.

One import, 3x speed improvement

In summary, for those already using Transformers v5, NVIDIA NeMo AutoModel offers a seamless upgrade path:

Simply add one line of import code to achieve a 3x speedup in MoE fine-tuning.

NeMo AutoModel

On Qwen3-30B-A3B and Nemotron 3 Nano 30B-A3B, this solution achieves a 3.4x to 3.7x increase in training throughput and reduces memory consumption by 29% to 32% compared to Transformers v5.

NVIDIA also demonstrated full-parameter fine-tuning results of the Nemotron 3 Ultra 550B A55B across 16 H100 nodes and 128 GPUs.

NeMo AutoModel

TPS/GPU is 815, TFLOP/s/GPU is approximately 293, and peak memory is 58.2 GiB.

The reason we didn't compare with v5 is that Transformers v5 would directly exhaust the memory at this scale ¯_(ツ)_/¯

If you're interested, NVIDIA has already posted the code, configurations, and benchmarking scripts on GitHub: https://github.com/NVIDIA-NeMo/Automodel/tree/blog/transformers-v5-automodel/blog_experiments

Specific usage instructions are available here: https://docs.nvidia.com/nemo/automodel/latest/get-started/hf-compatibility

This article is from the WeChat public account "Quantum Bit," authored by Yu Yang.

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.