According to monitoring by Beating, Dylan Zhang, a Ph.D. student in Computer Science at the University of Illinois, conducted a series of agent memory experiments, yielding an unexpected conclusion: repeatedly summarizing experiences may cause the model to remember worse over time. The most striking results came from ARC-AGI: researchers selected 19 questions that GPT-5.4 could answer correctly without memory, then fed the model the correct solutions and asked it to write “experience summaries” while viewing them. Logically, this should have been like open-book studying; yet after multiple rounds of memory compression, the same model’s accuracy dropped from 100% to 54%. The original trajectories were correct—the problem arose when the model rewrote these correct paths into generalized experiences. Worse still, this memory degradation is not an isolated case. In the WebShop online shopping task, the AWM memory method scored 0.64 with 8 expert trajectories, but the score plummeted to 0.20 when the number of trajectories increased to 128—exactly returning to the no-memory baseline. In other words, the more memory is accumulated, the more its benefits are erased by itself. The issue is not “too little experience,” but “too much summarization.” The experiences written by large models are not objective logs; each summary is a re-generation. Eventually, specific premises are stripped away, rules from different tasks are blended together, and previously actionable details become vague platitudes like “take the most direct action” or “use the correct tool”—seemingly correct but practically useless. One extreme example in the paper shows that 50 structured memories were merged into a single one, compressing differences across multiple tasks into one generic process; in the next evaluation, this caused the model to lose 6 to 13 successful samples. The authors offer a restrained recommendation: avoid forcing agents to write “error logs” after every round. A more reliable approach is to preserve carefully selected original action trajectories and only abstract summaries when truly necessary. In experiments, the method that retained only original episodes while disabling abstraction matched or surpassed all tested compressed memory approaches across multiple agent benchmarks. For developers, this conclusion is straightforward: showing models what was actually done tends to be more effective than having them memorize a pile of abstract rules.
GPT-5.4 memory compression experiment shows accuracy drops to 54%
MarsBitShare






A recent study by MarsBit highlights how repeated memory compression can degrade AI model performance. Dylan Zhang, a Ph.D. student at the University of Illinois, found that GPT-5.4’s accuracy on the ARC-AGI benchmark fell from 100% to 54% after multiple rounds of compression. The issue arises as models rewrite correct solutions into generalized rules, losing critical details. Similar results were observed in the WebShop task, where more expert trajectories led to reduced performance. The findings suggest preserving raw operational data and limiting abstract summaries. Traders monitoring altcoins may find on-chain data increasingly valuable for evaluating AI-driven tools.
Source:Show original
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information.
Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.