Cursor Multi-Agent System Optimizes 235 NVIDIA GPU Operators in Three Weeks, Approaching Hardware Limits

ME News reports that on April 15 (UTC+8), according to monitoring by Beating, the AI programming tool Cursor disclosed its multi-agent system collaboration experiment with NVIDIA. The system autonomously operated for three weeks on 27 Blackwell B200 GPUs, tackling 235 real-world operator optimization problems extracted from over 124 production-grade open-source models including DeepSeek, Qwen, and Gemma. It generated and optimized GPU operator code from scratch, achieving an overall geometric mean speedup of 38%. GPU operator optimization is among the most challenging domains in software engineering, requiring engineers to master chip architecture, assembly-level instructions, and memory scheduling; a high-performance operator typically takes senior experts months or even years to refine. Cursor’s multi-agent system handled all 235 problems simultaneously: one planning agent assigned tasks and dynamically scheduled them based on performance metrics, while multiple working agents optimized in parallel. The system autonomously invoked NVIDIA’s SOL-ExecBench benchmarking pipeline to form an automated “test-debug-optimize” loop, with zero human intervention. The system ran two rounds using two different languages: CUDA C (with inline PTX assembly) to test raw hardware inference capability, and CuTe DSL to test its ability to learn new APIs rarely present in publicly available training data. Of the 235 problems, the system outperformed baselines on 149 (63%), with 45 (19%) achieving over 2x speedup. Three representative results: 1. BF16 Grouped Query Attention (extracted from Llama 3.1 8B inference scenario): 84% faster than the manually optimized FlashInfer library, with an SOL score of 0.9722—nearly reaching the hardware’s theoretical limit (perfect score: 1.0). 2. BF16 Matrix Multiplication: The automatically generated operator achieved 86% of NVIDIA’s hand-tuned cuBLAS performance and outperformed the baseline by up to 9% in small-M scenarios common in LLM decoding. 3. NVFP4 Linear Operations in Mixture-of-Experts Layers (extracted from MoE models such as Qwen3): The system autonomously identified bottlenecks in 4-bit floating-point quantization and applied targeted fusion optimizations, achieving a 39% speedup. Cursor acknowledged that the overall median SOL score was only 0.56, indicating significant room for improvement—primarily due to limited GPU resources (27 GPUs shared across all 235 tasks). Cursor stated that these multi-agent technologies “will be integrated into core products very soon.” An IDE company’s AI agent has now approached the performance of top human experts in assembly-level GPU optimization—far surpassing the narrative of “helping you write application code.” (Source: BlockBeats)