GPT-5.4 memory compression experiment shows accuracy drops to 54%

According to monitoring by Beating, Dylan Zhang, a Ph.D. student in Computer Science at the University of Illinois, conducted a series of agent memory experiments, yielding an unexpected conclusion: repeatedly summarizing experiences may cause the model to remember worse over time. The most striking results came from ARC-AGI: researchers selected 19 questions that GPT-5.4 could answer correctly without memory, then fed the model the correct solutions and asked it to write “experience summaries” while viewing them. Logically, this should have been like open-book studying; yet after multiple rounds of memory compression, the same model’s accuracy dropped from 100% to 54%. The original trajectories were correct—the problem arose when the model rewrote these correct paths into generalized experiences. Worse still, this memory degradation is not an isolated case. In the WebShop online shopping task, the AWM memory method scored 0.64 with 8 expert trajectories, but the score plummeted to 0.20 when the number of trajectories increased to 128—exactly returning to the no-memory baseline. In other words, the more memory is accumulated, the more its benefits are erased by itself. The issue is not “too little experience,” but “too much summarization.” The experiences written by large models are not objective logs; each summary is a re-generation. Eventually, specific premises are stripped away, rules from different tasks are blended together, and previously actionable details become vague platitudes like “take the most direct action” or “use the correct tool”—seemingly correct but practically useless. One extreme example in the paper shows that 50 structured memories were merged into a single one, compressing differences across multiple tasks into one generic process; in the next evaluation, this caused the model to lose 6 to 13 successful samples. The authors offer a restrained recommendation: avoid forcing agents to write “error logs” after every round. A more reliable approach is to preserve carefully selected original action trajectories and only abstract summaries when truly necessary. In experiments, the method that retained only original episodes while disabling abstraction matched or surpassed all tested compressed memory approaches across multiple agent benchmarks. For developers, this conclusion is straightforward: showing models what was actually done tends to be more effective than having them memorize a pile of abstract rules.