CUSP Benchmark Reveals AI Models Lack Scientific Forecasting Ability

ME AI message, according to monitoring by Beating, Stanford University, the University of Oxford, and the Allen Institute for Artificial Intelligence have jointly launched CUSP, a temporal benchmark to evaluate AI’s ability to predict scientific progress. The evaluation systematically tested leading large models including GPT-5.4, Claude Sonnet 4.5, and DeepSeek R1. Results show that large models perform excellently in mechanistic reasoning tasks such as understanding existing technological pathways. However, when predicting whether new discoveries can actually be realized, their accuracy approaches random guessing. Additionally, large models exhibit systematic delays in predicting the timing of scientific breakthroughs. Traditional AI evaluations are highly susceptible to information leakage, as models may simply recite scientific results already published in their training data. To measure true predictive capability, CUSP introduces temporal knowledge cutoff constraints. The research team compiled multidisciplinary frontier advancements from journals such as Nature and Science. This benchmark includes 4,760 scientific milestones, generating 17,429 specific evaluation tasks. Testing restricts model access to information via cutoff conditions and includes control experiments such as pre-cutoff web searches to distinguish between knowledge gaps and predictive gaps. Results indicate that large models cannot provide reliable guidance in scientific exploration without standard answers—at least in predicting scientific progress, current models are not capable of dependable foresight. In mechanistic reasoning tasks, models perform strongly; for example, GPT-5.4 achieved 81.9% accuracy in identifying plausible research directions from options. However, when evaluating feasibility—determining whether a claim can be realized—model accuracy ranges only between 45% and 52%. For predicting breakthrough timelines, large models consistently overestimate: GPT-5.4 lags by 14 months, Claude S4.5 by 17 months, and GPT-4o by as much as 26 months. In this task, LLaMA 3.3 shows the smallest time error at +4 months. In generative solution design, even though GPT-5.4 received the highest score of 5.04/10, the technical pathways it generated failed to align with actual scientific trajectories. This suggests models can produce plausible proposals but struggle to pinpoint the specific technological paths that later materialize. The gap in scientific prediction is even more pronounced for high-impact, groundbreaking breakthroughs. (Source: BlockBeats)