Nous Research open-sources Lighthouse Attention, achieving a 17x speed boost on B200

iconKuCoinFlash
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
On-chain news outlet MetaEra reported on May 16 (UTC+8) that Nous Research has open-sourced its Lighthouse Attention mechanism for long-context pre-training. The method achieves 17x faster computation on a single B200 GPU for 512K-length text and 1.4–1.7x faster training speed at 98K-length. It employs a two-step process that eliminates the need for low-level coding or additional training objectives. In tests, a 530M-parameter model trained on 50B tokens matched or outperformed traditional methods while significantly reducing training time. Crypto news platforms are highlighting the efficiency gains for developers and researchers.

AIMPACT News, May 16 (UTC+8): According to monitoring by Beating, Nous Research has open-sourced Lighthouse Attention, a long-context pretraining mechanism. When processing 512K-length text on a single B200 GPU, this approach achieves approximately 17x faster computation speed compared to traditional mechanisms, and delivers 1.4x to 1.7x end-to-end training acceleration at 98K length. Traditional attention mechanisms require computing pairwise relationships between all tokens, causing computational costs to surge quadratically as text length increases. Lighthouse Attention adopts a two-stage approach: first, it rapidly scans compressed summaries of the text at multiple levels, scores and selects key segments to form a shorter sequence, then directly feeds it to the existing efficient operator FlashAttention. By completely decoupling the selection logic from the core kernel, developers avoid the need to write low-level code or introduce additional training objectives. Previous acceleration methods using similar ideas often suffered side effects, causing models to lose their ability to carefully process text word-by-word after becoming accustomed to skipping. To avoid this pitfall, the research team trained the model primarily using the accelerated mode, only briefly switching back to full attention computation at the very end of training for fine-tuning. In real-world tests involving a 530-million-parameter model trained on 50 billion tokens, the resulting model not only significantly reduced training time but also matched or even surpassed the performance of baseline models trained entirely with traditional attention. (Source: BlockBeats)

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.