MetaEra open-sources FlashKDA, boosting Kimi's linear inference speed by 1.7–2.2x

ME News reports that on April 22 (UTC+8), according to monitoring by Beating, Moonshot AI has open-sourced FlashKDA on GitHub—a toolset designed to accelerate model inference specifically for NVIDIA Hopper-series GPUs (such as H100 and H20), licensed under the MIT license. It is tailored for KDA, a novel attention mechanism introduced by Moonshot AI in last year’s Kimi Linear paper. Traditional attention mechanisms suffer from quadratic computational growth with increasing sequence length, whereas linear attention reduces this cost to linear scaling; KDA is an improved variant along this path. The Kimi Linear model architecture alternates between three layers of KDA and one layer of traditional attention. A previous implementation of KDA was written in Triton and available in the open-source library flash-linear-attention (abbreviated as FLA). FlashKDA has been rewritten using NVIDIA’s low-level GPU library CUTLASS to fully exploit the performance of Hopper GPUs. Official benchmarks on the H20 show that FlashKDA achieves 1.7x to 2.2x faster forward inference compared to the Triton version, with particularly significant speedups in scenarios involving variable input lengths and batched processing. However, the comparison was made only against Moonshot’s own Triton implementation, not against other linear attention alternatives. This release includes only the forward pass—meaning it supports model inference but not training; training still requires the original Triton version. Requirements: Hopper or newer GPUs (SM90 architecture or higher), CUDA 12.9+, and PyTorch 2.4+. FlashKDA has also been merged as a new backend into the upstream FLA repository (PR #852); existing users can switch with just a single configuration change. (Source: BlockBeats)