Huawei and USTC collaborate to break NVIDIA's monopoly, with the Ascend A3 model's computation speed increasing by 58%.

iconKuCoinFlash
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
Huawei and USTC have developed the HyperParallel-MoE framework to enhance the Ascend A3 chip's performance in large model training. The design improves scheduling by managing hardware queues at the tile level, enabling AIC and AIV cores to operate in parallel. Tests on a 64-node cluster demonstrated a 58% increase in expert computation speed and an 8–9% improvement in end-to-end training speed. This advancement could shift support and resistance levels in the AI chip market, providing a more favorable risk-to-reward ratio for domestic technology adoption.
ME AI News: According to monitoring by Beating, leveraging China’s domestic Ascend chips for training large models has become a key direction in building autonomous and controllable AI computing power amid the evolution of large-scale MoE architectures. However, most mainstream large model frameworks are developed based on NVIDIA’s CUDA ecosystem, and direct porting to the Ascend platform often faces challenges such as uneven hardware queue scheduling and low compute utilization. A joint effort by the University of Science and Technology of China, Huawei, and Peking University has introduced the HyperParallel-MoE compilation and scheduling framework, which enables tile-level control tailored to the unique hardware queues of the Ascend A3, aiming to break through the energy-efficiency bottlenecks of parallel scheduling on heterogeneous compute resources. The Ascend A3 features two core types: AIC handles matrix multiplication, while AIV manages vector computation and communication. Under traditional operator serial scheduling, these two core types can only alternate operations, leaving one idle while the other runs. Real-world measurements show that when running a 671B DeepSeek-style large model across a 256-node cluster, AIC utilization was only 67%, and 39% of expert routing communication latency occurred along the critical computation path. HyperParallel-MoE introduces three key innovations. First, it designs an AIV-driven unilateral write primitive that triggers computation as soon as a data tile arrives, eliminating the need to wait for an entire batch. Second, it introduces dependency-aware tile task generation, unifying communication and computation operators under a single abstraction. Third, it employs a static scheduler to pre-generate task sequences that drive both core types in parallel within a single kernel, while leveraging high-speed L2 cache to share intermediate results—reducing latency from frequent reads and writes to slow HBM memory. Testing shows that under a 64-node balanced routing setup, the MoE-FFN module responsible for expert computation achieves a 36% reduction in latency, equivalent to a maximum 58% increase in data processing speed (i.e., a 1.49x to 1.58x speedup). End-to-end single-step training performance also improves by 8% to 9%. This demonstrates that the actual energy efficiency of Ascend hardware depends not only on its specifications but critically on whether the compiler and runtime can effectively schedule the AIC and AIV cores. (Source: BlockBeats)
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.