Ramp Labs Proposes New Multi-Agent Memory Sharing Solution, Token Usage Reduced by Up to 65%

iconKuCoinFlash
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
Ramp Labs, an AI infrastructure company, has proposed a new multi-agent memory sharing solution called "Latent Briefing" to reduce token usage by up to 65%. The method compresses large model KV caches, improving efficiency without sacrificing accuracy. In LongBench v2 tests, token consumption decreased by 65%, with a 49% median reduction for medium-length texts. Accuracy improved by 3 percentage points, and compression took only 1.7 seconds—20 times faster. The system employs Claude Sonnet 4 as the orchestrator and Qwen3-14B as the worker model. The solution aligns with MiCA compliance and supports CFT initiatives by enhancing operational transparency.

ME News reports that on April 11 (UTC+8), AI infrastructure company Ramp Labs released its research titled “Latent Briefing,” which enables efficient memory sharing among multi-agent systems by directly compressing large model KV caches, significantly reducing token consumption without sacrificing accuracy. In mainstream multi-agent architectures, the orchestrator decomposes tasks and repeatedly invokes worker models; as the reasoning chain extends, token usage grows exponentially. The core idea of Latent Briefing is to leverage attention mechanisms to identify the most critical parts of the context and discard redundant information directly at the representation layer, rather than relying on slow LLM summarization or unstable RAG retrieval. On the LongBench v2 benchmark, the method demonstrated outstanding performance: worker model token consumption decreased by 65%, with a median token saving of 49% for medium-length documents (32k to 100k), overall accuracy improved by approximately 3 percentage points compared to the baseline, and each compression added only about 1.7 seconds of overhead—roughly 20 times faster than the original algorithm. Experiments used Claude Sonnet 4 as the orchestrator and Qwen3-14B as the worker model, covering diverse document types including academic papers, legal documents, novels, and government reports. The study also found that the optimal compression threshold varies with task difficulty and document length—aggressive compression is better suited for complex tasks to filter out speculative reasoning noise, while lighter compression is preferable for long documents to preserve dispersed key information. (Source: BlockBeats)

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.