SGLang and AMD collaborate to optimize DeepSeek-R1 inference on the MI355X GPU.

iconKuCoinFlash
Share
AI summary iconSummary
ME AI message: SGLang, in collaboration with the AMD team, has achieved a highly competitive total cost of ownership for AMD Instinct™ MI355X GPUs running DeepSeek-R1 large model inference through a series of full-stack optimizations. At an interactive latency of 129 tok/s/user, the cost is $0.169 per million tokens—5% lower than the NVIDIA B200 (Dynamo TRT-LLM) solution and 40% lower than the B200 (SGLang) solution. In terms of throughput, 24 AMD GPUs achieve 2,436 tok/s/GPU, which is 1.25 times higher per GPU than the B200 SGLang solution using 48 GPUs. Key optimizations include: MoRI mixed FP4/FP8 quantization for all-to-all communication, MoRI-IO KV Cache backend, batch overlapping with SDMA, Specv2 MTP on ROCm, and CPU streaming optimizations. (Source: AiHot)
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.