Ramp Labs Proposes New Multi-Agent Memory Sharing Solution, Token Usage Reduced by Up to 65%

ME News reports that on April 11 (UTC+8), AI infrastructure company Ramp Labs released its research titled “Latent Briefing,” which enables efficient memory sharing among multi-agent systems by directly compressing large model KV caches, significantly reducing token consumption without sacrificing accuracy. In mainstream multi-agent architectures, the orchestrator decomposes tasks and repeatedly invokes worker models; as the reasoning chain extends, token usage grows exponentially. The core idea of Latent Briefing is to leverage attention mechanisms to identify the most critical parts of the context and discard redundant information directly at the representation layer, rather than relying on slow LLM summarization or unstable RAG retrieval. On the LongBench v2 benchmark, the method demonstrated outstanding performance: worker model token consumption decreased by 65%, with a median token saving of 49% for medium-length documents (32k to 100k), overall accuracy improved by approximately 3 percentage points compared to the baseline, and each compression added only about 1.7 seconds of overhead—roughly 20 times faster than the original algorithm. Experiments used Claude Sonnet 4 as the orchestrator and Qwen3-14B as the worker model, covering diverse document types including academic papers, legal documents, novels, and government reports. The study also found that the optimal compression threshold varies with task difficulty and document length—aggressive compression is better suited for complex tasks to filter out speculative reasoning noise, while lighter compression is preferable for long documents to preserve dispersed key information. (Source: BlockBeats)