ME News reports that on April 22 (UTC+8), according to monitoring by Beating, Moonshot AI has open-sourced FlashKDA on GitHub—a toolset designed to accelerate model inference specifically for NVIDIA Hopper-series GPUs (such as H100 and H20), licensed under the MIT license. It is tailored for KDA, a novel attention mechanism introduced by Moonshot AI in last year’s Kimi Linear paper. Traditional attention mechanisms suffer from quadratic computational growth with increasing sequence length, whereas linear attention reduces this cost to linear scaling; KDA is an improved variant along this path. The Kimi Linear model architecture alternates between three layers of KDA and one layer of traditional attention. A previous implementation of KDA was written in Triton and available in the open-source library flash-linear-attention (abbreviated as FLA). FlashKDA has been rewritten using NVIDIA’s low-level GPU library CUTLASS to fully exploit the performance of Hopper GPUs. Official benchmarks on the H20 show that FlashKDA achieves 1.7x to 2.2x faster forward inference compared to the Triton version, with particularly significant speedups in scenarios involving variable input lengths and batched processing. However, the comparison was made only against Moonshot’s own Triton implementation, not against other linear attention alternatives. This release includes only the forward pass—meaning it supports model inference but not training; training still requires the original Triton version. Requirements: Hopper or newer GPUs (SM90 architecture or higher), CUDA 12.9+, and PyTorch 2.4+. FlashKDA has also been merged as a new backend into the upstream FLA repository (PR #852); existing users can switch with just a single configuration change. (Source: BlockBeats)
MetaEra open-sources FlashKDA, boosting Kimi's linear inference speed by 1.7–2.2x
KuCoinFlashShare






On April 22 (UTC+8), MetaEra announced the open-sourcing of FlashKDA, a tool optimized for NVIDIA Hopper GPUs under the MIT license. Designed to accelerate Kimi Linear inference by 1.7–2.2x, FlashKDA leverages CUTLASS to enhance performance on H20 GPUs. It supports variable input lengths and batched processing, though it currently enables inference only. Users require Hopper GPUs, CUDA 12.9+, and PyTorch 2.4+. The tool is now integrated into the flash-linear-attention repository, contributing to on-chain innovation and ecosystem growth.
Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.