General-Purpose LLMs Outperform Dedicated Medical AI Tools in Nature Medicine Study

A study published June 12, 2026, in Nature Medicine found that general-purpose large language models consistently outperformed dedicated clinical AI products across standardized medical tasks. The general-purpose models were also preferred by the clinicians using them.

What the study actually tested

The researchers pitted three major general-purpose LLMs against purpose-built medical tools. On one side: OpenAI’s GPT-5.2, Google’s Gemini 3.1 Pro Preview, and Anthropic’s Claude Opus 4.6. On the other: dedicated clinical products like OpenEvidence and UpToDate Expert AI, tools specifically designed and marketed for healthcare professionals.

The battleground included MedQA questions, a well-established benchmark for evaluating medical knowledge drawn from medical licensing exams. The general-purpose models excelled across these tasks, beating the specialists on their home turf.

Google Search AI Overview was included as a control, representing the kind of quick-reference tool physicians actually reach for during a busy shift.

A pattern that keeps repeating

A February 2025 study found that chatbots outperformed physicians who were limited to internet references for clinical decision-making.

Then came a randomized controlled study published February 9, 2026, involving 1,298 participants in the UK. Standalone LLMs achieved 94.9% accuracy in identifying medical conditions. The collaborative performance, where physicians worked alongside LLMs, did not surpass the control group.

Why this matters beyond healthcare

The researchers themselves identified a gap between high benchmark performance and real-world clinical applicability. Regulatory compliance, electronic health record integration, and liability frameworks do not show up in a MedQA score.

But clinician preference is hard to dismiss. If doctors actively prefer using GPT-5.2 over a tool built specifically for them, that’s a market signal, not just a research finding.