Aurora Optimizer Reduces Dead Neurons by 25% in Muon, Increases Training Efficiency 100x

According to monitoring by Beating, Tilde Research discovered a hidden flaw in Muon, the optimizer adopted by leading models such as DeepSeek V4, Kimi K2.5, and GLM-5: it causes more than a quarter of neurons in MLP layers to permanently die during early training. The team designed and open-sourced a replacement optimizer called Aurora. A 1.1B model trained on just ~100B tokens achieved performance on language understanding benchmarks like HellaSwag and Winogrande that matches Qwen3-1.7B trained on 36T tokens. The issue stems from a mathematical property in how Muon processes MLP weight matrices. During early training, some neurons happen to receive weaker gradient signals. Traditional optimizers like AdamW normalize updates parameter-by-parameter, naturally smoothing out these disparities; however, Muon’s orthogonalization step transmits weak signals unchanged. Weak neurons continue receiving weak updates, growing increasingly inactive, creating a “rich-get-richer” dead loop. By step 500 of training, over a quarter of neurons are effectively dead, wasting valuable parameter capacity. Previous improvements like NorMuon mitigated this by forcing equal update magnitudes across each row, but at the cost of destroying the update matrix’s orthogonality—orthogonalization is Muon’s core advantage, enabling highly efficient updates—and thus sacrificing optimization precision. Aurora treats “uniform updates” and “orthogonality” as joint constraints, using alternating iterations to satisfy both simultaneously: ensuring every neuron gets a fair chance to learn without compromising update precision. Aurora requires only 6% more computational overhead than Muon without tuning and can directly replace it. In modded-nanoGPT benchmark tests, Aurora set a new state-of-the-art record at 3,175 steps. Aurora’s advantages become more pronounced as MLP width increases—the higher the scaling factor, the greater the improvement. The code and the 1.1B pre-trained model have both been open-sourced.