International Study Shows Large Language Models Now Excel on Ophthalmology Specialty Exams but Risks Remain

02/05/2026

A new study published this week in Eye reveals that the latest generation of large language models (LLMs) have made dramatic gains in performance on specialty ophthalmology examinations—with some models now surpassing top human scores—but also highlights persistent limitations that could pose safety risks in clinical applications.

The research, led by an international team of ophthalmologists and AI experts, benchmarked six major LLMs on a set of postgraduate ophthalmology questions drawn from the Fellowship of the Royal College of Ophthalmologists (FRCOphth) exam.

According to the analysis, all evaluated LLMs showed significant improvement over earlier versions tested in 2023, with the most advanced models achieving exceptionally high accuracy on both the basic science and clinical reasoning sections of the exam. Notably, Google’s Gemini 2.5 Pro achieved a 98% accuracy on the Part One basic science section and 90.7% on the Part Two clinical reasoning section, outperforming its predecessors and exceeding typical top human candidate scores.

Other models, including OpenAI’s ChatGPT 4o and ChatGPT 5, Anthropic’s Claude Sonnet 4.0, xAI’s Grok 3, and Deepseek-V3, also demonstrated strong results, with most surpassing 80% accuracy on both exam sections. The improvements from 2023 to 2025 were statistically significant for both exam parts.

Despite these advances, the study’s qualitative error analysis uncovered important limitations. In several instances where all models agreed on an incorrect answer, expert reviewers identified reasoning flaws—such as premature diagnostic closure and an overreliance on pattern recognition instead of comprehensive clinical reasoning. In one example, models incorrectly prioritized a common pharmacologic intervention over necessary neuroimaging, potentially missing critical pathology if applied in real practice.

The researchers also tested how well models handled image-based questions and found notably lower performance compared with text-based items, underscoring the ongoing challenge of reliable multimodal reasoning.

“LLMs are now outperforming human benchmarks on standardized ophthalmology exams, which is an exciting milestone,” said the study authors. However, they emphasized that high exam performance does not necessarily translate to clinical safety or utility. The models’ tendencies to make confidently wrong decisions without adequate justification highlight the need for careful human oversight.

The team advocates for ‘clinician-in-the-loop’ workflows, in which AI serves as a tool to augment, rather than replace, clinical judgment. They also call for further research into model behavior on diverse diagnostic tasks and real-world decision support scenarios.

Originally published online on Eyewire+.

MEDICAL NEWS

RESOURCES

EDITORIAL SERIES

PODCASTS

VIDEOS

International Study Shows Large Language Models Now Excel on Ophthalmology Specialty Exams but Risks Remain

Title

Share on ReachMD