New Study Shows Foundation AI Models Closing the Gap With Doctors—But Multimodal Challenges Remain

01/16/2026

A new cross-sectional study published in JAMA Ophthalmology examines how current foundation artificial intelligence (AI) models perform on ophthalmology exam questions compared with human physicians. The data reveal notable progress on text-based tasks but ongoing limits in interpreting multimodal clinical information.

Researchers evaluated seven foundation models—including GPT-4o (OpenAI), Gemini 1.5 Pro (Google), and Claude 3.5 Sonnet (Anthropic)—using a set of offline written and multimodal questions drawn from the Fellowship of the Royal College of Ophthalmologists part 2 exam preparation materials. The goal was to assess how well these models could answer multiple-choice questions typical of clinical knowledge testing and compare their performance to physicians with varying levels of ophthalmology experience.

On textual questions, the latest foundation models showed considerable improvements over older large language models and even outperformed ophthalmology trainees and junior physicians. Claude 3.5 Sonnet achieved the highest accuracy among AI models at 77.7%, surpassing trainees and closely approaching the performance of expert ophthalmologists in the study. Other models such as GPT-4o and Qwen2.5-Max also demonstrated strong performance on this task.

These results suggest that modern foundation AI could be useful in answering text-based clinical queries and interacting with structured clinical data—areas where AI is rapidly catching up to human clinicians.

However, when the evaluation included multimodal questions that combined text with clinical images or other visual data, all AI models tested—including GPT-4o, which led the group with about 57.5% accuracy—lagged behind both expert ophthalmologists and trainees. Human experts achieved significantly higher accuracy on these more complex, visual-integrated tasks.

The researchers said while AI has grown adept at interpreting written clinical text, it still struggles to match human clinicians’ ability to integrate diverse information types such as imaging and tables.

The authors conclude that although foundation models have matured markedly and show promise as educational tools or clinical support systems for textual reasoning, substantial work remains before their multimodal capabilities are ready for high-stakes clinical application. Larger and more diverse training datasets, task-specific fine-tuning, and rigorous evaluation across real-world clinical scenarios will be crucial steps in closing the gap between AI and human clinicians in comprehensive clinical reasoning.

Originally published online on Eyewire+.

MEDICAL NEWS

RESOURCES

EDITORIAL SERIES

PODCASTS

VIDEOS

New Study Shows Foundation AI Models Closing the Gap With Doctors—But Multimodal Challenges Remain

Title

Share on ReachMD