Harvard Mathematicians Test AI on Unpublished Research-Level Problems

iconCryptoBriefing
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
AI + crypto news from Harvard shows AI systems solved seven of 10 original math problems. The test, released June 10, 2026, used problems not in training data. Models from OpenAI and Google improved from early trials. On-chain news and AI advancements continue to draw attention. Exact reasons for progress remain unclear.

Here’s a question that keeps researchers up at night: can AI actually do math, or is it just really good at pattern-matching against problems it’s already seen? A group of 30 mathematicians at Harvard decided to find out the hard way, by giving leading AI systems a test they couldn’t possibly have studied for.

The project, called “First Proof, Second Batch,” assembled its expert panel at Harvard’s Center of Mathematical Sciences and Applications in early June 2026. Their task was straightforward but unprecedented in scale: blind-grade AI-generated solutions to 10 original, unpublished research-level mathematics problems. The results, released on June 10, paint a picture that’s neither the doom scenario nor the triumph that partisans on either side might prefer.

The setup: why unpublished problems matter

The entire exercise hinges on one critical design choice. Every problem in the set was drawn from active, unpublished research. None of these questions had appeared in textbooks, on arXiv, or anywhere else an AI’s training data could have scraped them.

Advertisement

The mathematicians behind the project aren’t exactly lightweights, either. The roster includes Mohammed Abouzaid from Stanford, Nikhil Srivastava from UC Berkeley, Rachel Ward from UT Austin, and Lauren Williams of Harvard.

What the AI actually got right, and wrong

Four leading AI systems participated in the evaluation, including models from OpenAI and Google. The headline number: the expert panel awarded passing grades on seven of the 10 problems across the four systems tested.

In preliminary and early trial runs, AI systems reportedly solved only 2 of the 10 problems. The gap between early performance and final results suggests that the models may have benefited from multiple attempts or different prompting strategies, though the blind grading protocol was designed to evaluate the quality of submitted solutions on their merits alone.

Building on earlier results

This second batch builds on an initial round of assessments conducted in February 2026. The First Proof project was designed from the start as an ongoing evaluation framework, not a one-time stunt. By running multiple rounds with fresh problems each time, the organizers can track whether AI capabilities are genuinely improving at research-level mathematics or simply plateauing after the initial rush of benchmark gains.

Standard math benchmarks, even difficult ones like competition-level problems, have increasingly fallen to frontier models. But competition problems, by definition, have known solutions and known solution methods. Research-level mathematics operates in a fundamentally different regime, where you often don’t know if a solution even exists, let alone what techniques might get you there.

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.