CUSP Benchmark Reveals AI Models Lack Scientific Forecasting Ability

iconKuCoinFlash
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
AI and crypto news platforms are tracking a new benchmark called CUSP, developed by Stanford, Oxford, and the Allen Institute for AI. The test reveals that leading models such as GPT-5.4 and Claude Sonnet 4.5 struggle to forecast scientific progress. Accuracy is near random, and timing estimates are off by months. The CUSP dataset includes 4,760 milestones and 17,429 tasks. New token listings often rely on predictive models, but this study raises concerns about their reliability.
ME AI message, according to monitoring by Beating, Stanford University, the University of Oxford, and the Allen Institute for Artificial Intelligence have jointly launched CUSP, a temporal benchmark to evaluate AI’s ability to predict scientific progress. The evaluation systematically tested leading large models including GPT-5.4, Claude Sonnet 4.5, and DeepSeek R1. Results show that large models perform excellently in mechanistic reasoning tasks such as understanding existing technological pathways. However, when predicting whether new discoveries can actually be realized, their accuracy approaches random guessing. Additionally, large models exhibit systematic delays in predicting the timing of scientific breakthroughs. Traditional AI evaluations are highly susceptible to information leakage, as models may simply recite scientific results already published in their training data. To measure true predictive capability, CUSP introduces temporal knowledge cutoff constraints. The research team compiled multidisciplinary frontier advancements from journals such as Nature and Science. This benchmark includes 4,760 scientific milestones, generating 17,429 specific evaluation tasks. Testing restricts model access to information via cutoff conditions and includes control experiments such as pre-cutoff web searches to distinguish between knowledge gaps and predictive gaps. Results indicate that large models cannot provide reliable guidance in scientific exploration without standard answers—at least in predicting scientific progress, current models are not capable of dependable foresight. In mechanistic reasoning tasks, models perform strongly; for example, GPT-5.4 achieved 81.9% accuracy in identifying plausible research directions from options. However, when evaluating feasibility—determining whether a claim can be realized—model accuracy ranges only between 45% and 52%. For predicting breakthrough timelines, large models consistently overestimate: GPT-5.4 lags by 14 months, Claude S4.5 by 17 months, and GPT-4o by as much as 26 months. In this task, LLaMA 3.3 shows the smallest time error at +4 months. In generative solution design, even though GPT-5.4 received the highest score of 5.04/10, the technical pathways it generated failed to align with actual scientific trajectories. This suggests models can produce plausible proposals but struggle to pinpoint the specific technological paths that later materialize. The gap in scientific prediction is even more pronounced for high-impact, groundbreaking breakthroughs. (Source: BlockBeats)
Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.