BlockBeats report, March 3: Developer Manjeet Singh (GitHub: maderix), in collaboration with Claude Opus, successfully performed neural network training with backpropagation on Apple’s Neural Engine (ANE) on the M4 chip for the first time, through reverse engineering Apple’s undocumented private APIs. The ANE is an accelerator designed by Apple specifically for inference, and training capabilities have never been officially supported; developers can only indirectly access its inference functions via the CoreML framework.
This project bypasses CoreML, directly mapping over 40 private classes—including _ANEClient and _ANECompiler—to the IOKit kernel driver, and discovered the _ANEInMemoryModelDescriptor interface, which enables direct model compilation in memory—a critical capability for training, as weight updates require recompilation at each step. Currently, training for a single transformer layer (dim=768, seq=512) is implemented, achieving 9.3ms per step on the M4, with ANE utilization at 11.2% (1.78 TFLOPS, compared to a theoretical peak of 15.8 TFLOPS). Forward and backward propagation gradients are computed on the ANE, while weight gradients and the Adam optimizer are handled on the CPU.
The project also found that ANE's core computational primitive is convolution, not matrix multiplication; expressing matrix multiplication as 1x1 convolutions yields approximately a 3x throughput improvement, and bypassing CoreML to call directly provides an additional 2x to 4x gain, suggesting Apple's advertised "38 TOPS" is misleading. The project is still in its early stages: it currently supports only single-layer training, uses synthetic data, and has approximately 119 post-compilation resource leaks requiring process restarts to circumvent; multi-layer training and real-data support are under development. The project is open-sourced under the MIT license and has garnered around 2,800 stars in five days.
