Huawei and USTC collaborate to break NVIDIA's monopoly, with the Ascend A3 model's computation speed increasing by 58%.

ME AI News: According to monitoring by Beating, leveraging China’s domestic Ascend chips for training large models has become a key direction in building autonomous and controllable AI computing power amid the evolution of large-scale MoE architectures. However, most mainstream large model frameworks are developed based on NVIDIA’s CUDA ecosystem, and direct porting to the Ascend platform often faces challenges such as uneven hardware queue scheduling and low compute utilization. A joint effort by the University of Science and Technology of China, Huawei, and Peking University has introduced the HyperParallel-MoE compilation and scheduling framework, which enables tile-level control tailored to the unique hardware queues of the Ascend A3, aiming to break through the energy-efficiency bottlenecks of parallel scheduling on heterogeneous compute resources. The Ascend A3 features two core types: AIC handles matrix multiplication, while AIV manages vector computation and communication. Under traditional operator serial scheduling, these two core types can only alternate operations, leaving one idle while the other runs. Real-world measurements show that when running a 671B DeepSeek-style large model across a 256-node cluster, AIC utilization was only 67%, and 39% of expert routing communication latency occurred along the critical computation path. HyperParallel-MoE introduces three key innovations. First, it designs an AIV-driven unilateral write primitive that triggers computation as soon as a data tile arrives, eliminating the need to wait for an entire batch. Second, it introduces dependency-aware tile task generation, unifying communication and computation operators under a single abstraction. Third, it employs a static scheduler to pre-generate task sequences that drive both core types in parallel within a single kernel, while leveraging high-speed L2 cache to share intermediate results—reducing latency from frequent reads and writes to slow HBM memory. Testing shows that under a 64-node balanced routing setup, the MoE-FFN module responsible for expert computation achieves a 36% reduction in latency, equivalent to a maximum 58% increase in data processing speed (i.e., a 1.49x to 1.58x speedup). End-to-end single-step training performance also improves by 8% to 9%. This demonstrates that the actual energy efficiency of Ascend hardware depends not only on its specifications but critically on whether the compiler and runtime can effectively schedule the AIC and AIV cores. (Source: BlockBeats)