According to monitoring by Beating, Tilde Research discovered a hidden flaw in Muon, the optimizer adopted by leading models such as DeepSeek V4, Kimi K2.5, and GLM-5: it causes more than a quarter of neurons in MLP layers to permanently die during early training. The team designed and open-sourced a replacement optimizer called Aurora. A 1.1B model trained on just ~100B tokens achieved performance on language understanding benchmarks like HellaSwag and Winogrande that matches Qwen3-1.7B trained on 36T tokens. The issue stems from a mathematical property in how Muon processes MLP weight matrices. During early training, some neurons happen to receive weaker gradient signals. Traditional optimizers like AdamW normalize updates parameter-by-parameter, naturally smoothing out these disparities; however, Muon’s orthogonalization step transmits weak signals unchanged. Weak neurons continue receiving weak updates, growing increasingly inactive, creating a “rich-get-richer” dead loop. By step 500 of training, over a quarter of neurons are effectively dead, wasting valuable parameter capacity. Previous improvements like NorMuon mitigated this by forcing equal update magnitudes across each row, but at the cost of destroying the update matrix’s orthogonality—orthogonalization is Muon’s core advantage, enabling highly efficient updates—and thus sacrificing optimization precision. Aurora treats “uniform updates” and “orthogonality” as joint constraints, using alternating iterations to satisfy both simultaneously: ensuring every neuron gets a fair chance to learn without compromising update precision. Aurora requires only 6% more computational overhead than Muon without tuning and can directly replace it. In modded-nanoGPT benchmark tests, Aurora set a new state-of-the-art record at 3,175 steps. Aurora’s advantages become more pronounced as MLP width increases—the higher the scaling factor, the greater the improvement. The code and the 1.1B pre-trained model have both been open-sourced.
Aurora Optimizer Reduces Dead Neurons by 25% in Muon, Increases Training Efficiency 100x
MarsBitShare






Aurora, a new optimizer from Tilde Research, reduces dead neurons by 25% in Muon—a tool used by leading models such as DeepSeek V4 and Kimi K2.5. The open-source Aurora optimizer enhances training efficiency by 100x. A 1.1B model trained on 100B tokens with Aurora performs on par with Qwen3-1.7B trained on 36T tokens. This on-chain development represents a major advancement in model optimization. Aurora balances update uniformity and orthogonality, adding only 6% overhead. It has already established a new benchmark in modded-nanoGPT optimization and is one of the top altcoin stories of 2025.
Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.