Developer Achieves First Neural Network Training on Apple Neural Engine via Reverse Engineering

iconKuCoinFlash
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
A developer has successfully executed the first neural network training with backpropagation on Apple’s Neural Engine in the M4 chip through reverse engineering. The project bypassed CoreML by mapping over 40 private classes to the IOKit kernel driver for in-memory model compilation. Performance reached 1.78 TFLOPS, with support for a single transformer layer. This on-chain announcement signifies a hardware utilization upgrade. The code is open-source under the MIT license.

BlockBeats report, March 3: Developer Manjeet Singh (GitHub: maderix), in collaboration with Claude Opus, successfully performed neural network training with backpropagation on Apple’s Neural Engine (ANE) on the M4 chip for the first time, through reverse engineering Apple’s undocumented private APIs. The ANE is an accelerator designed by Apple specifically for inference, and training capabilities have never been officially supported; developers can only indirectly access its inference functions via the CoreML framework.


This project bypasses CoreML, directly mapping over 40 private classes—including _ANEClient and _ANECompiler—to the IOKit kernel driver, and discovered the _ANEInMemoryModelDescriptor interface, which enables direct model compilation in memory—a critical capability for training, as weight updates require recompilation at each step. Currently, training for a single transformer layer (dim=768, seq=512) is implemented, achieving 9.3ms per step on the M4, with ANE utilization at 11.2% (1.78 TFLOPS, compared to a theoretical peak of 15.8 TFLOPS). Forward and backward propagation gradients are computed on the ANE, while weight gradients and the Adam optimizer are handled on the CPU.


The project also found that ANE's core computational primitive is convolution, not matrix multiplication; expressing matrix multiplication as 1x1 convolutions yields approximately a 3x throughput improvement, and bypassing CoreML to call directly provides an additional 2x to 4x gain, suggesting Apple's advertised "38 TOPS" is misleading. The project is still in its early stages: it currently supports only single-layer training, uses synthetic data, and has approximately 119 post-compilation resource leaks requiring process restarts to circumvent; multi-layer training and real-data support are under development. The project is open-sourced under the MIT license and has garnered around 2,800 stars in five days.


Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.