NVIDIA Open-Sources MoE Optimization Tool, Speeds Up Fine-Tuning by 3.7x

One line of import speeds up MoE large model fine-tuning by 3.7x.

NVIDIA's latest research is now open-source: NeMo AutoModel, designed specifically for building and fine-tuning large-scale generative AI models.

Based on Hugging Face Transformers v5, NeMo AutoModel enables faster fine-tuning of MoE models with just one additional import statement, without modifying the code API.

Experiments show that NVIDIA NeMo AutoModel achieves a 3.4x to 3.7x increase in training throughput and reduces GPU memory usage by 29% to 32% compared to the original Hugging Face Transformers v5 during MoE fine-tuning.

On a single-node system with 8x H100 80GB GPUs, using Qwen3-30B-A3B as an example, NeMo AutoModel increased the TPS/GPU (throughput per GPU per second) from 3,075 to 11,340, achieving a 3.69x improvement.

Core Technology Analysis

MoE has become the dominant architecture for state-of-the-art models, but it also introduces new challenges for efficient training:

Expert parallelization, communication fusion, kernel optimization—these complex engineering tasks require supporting infrastructure.

Hugging Face's Transformers v5 is currently one of the most widely used general-purpose foundations for MoE training. Version 5 enhances native MoE support by introducing core MoE capabilities such as expert backends, dynamic weight loading, and distributed execution.

This time, NVIDIA’s approach is to build on the work of predecessors by supporting the Hugging Face Transformers API, enabling users to achieve higher training throughput and lower GPU memory usage in MoE fine-tuning with minimal code changes.

Specifically, NeMo AutoModel builds on Transformers v5 by adding Expert Parallelism (EP), DeepEP, and TransformerEngine.

Expert Parallelism

Expert parallel technology is primarily used to reduce memory pressure.

EP distributes expert weights across multiple GPUs, so that each GPU no longer holds all experts in full, but only a portion of their parameters.

For example, with 8 GPUs and ep_size=8, the expert weights are distributed across the 8 GPUs, reducing the MoE memory usage per GPU to 1/8 of the original.

Based on the experimental results, this technology reduces the peak memory usage for Qwen3 from 68.2 GiB to 48.1 GiB, a 29% reduction.

For the Nemotron Nanomo model, memory usage has decreased from 62.1 GiB to 42.5 GiB, a reduction of 32%.

The freed-up space can be used to support larger batches and longer sequences.

DeepEP

DeepEP achieves the integration of computation and communication.

In traditional approaches, there is significant communication overhead between token distribution and expert computation. DeepEP integrates token distribution and combination operations into optimized GPU kernels, enabling overlap between communication and expert computation.

TransformerEngine

The TransformerEngine kernel accelerates various core operations.

This technology integrates implementations such as attention mechanisms, linear layers, and RMSNorm, accelerating not only MoE layers but also standard Transformer layers.

One import, 3x speed improvement

In summary, for those already using Transformers v5, NVIDIA NeMo AutoModel offers a seamless upgrade path:

Simply add one line of import code to achieve a 3x speedup in MoE fine-tuning.

On Qwen3-30B-A3B and Nemotron 3 Nano 30B-A3B, this solution achieves a 3.4x to 3.7x increase in training throughput and reduces memory consumption by 29% to 32% compared to Transformers v5.

NVIDIA also demonstrated full-parameter fine-tuning results of the Nemotron 3 Ultra 550B A55B across 16 H100 nodes and 128 GPUs.

TPS/GPU is 815, TFLOP/s/GPU is approximately 293, and peak memory is 58.2 GiB.

The reason we didn't compare with v5 is that Transformers v5 would directly exhaust the memory at this scale ¯_(ツ)_/¯

If you're interested, NVIDIA has already posted the code, configurations, and benchmarking scripts on GitHub: https://github.com/NVIDIA-NeMo/Automodel/tree/blog/transformers-v5-automodel/blog_experiments

Specific usage instructions are available here: https://docs.nvidia.com/nemo/automodel/latest/get-started/hf-compatibility

This article is from the WeChat public account "Quantum Bit," authored by Yu Yang.