GPT-5 surpasses human doctors in medical diagnosis tests
Researchers at Emory University have demonstrated that GPT-5, the latest large language model from OpenAI, significantly outperforms human medical experts on standardised diagnostic reasoning tasks. The AI achieved remarkable accuracy rates exceeding 95% on medical licensing examinations whilst showing particular strength in integrating visual and textual clinical information.
A groundbreaking study from Emory University School of Medicine has revealed that GPT-5, OpenAI’s newest artificial intelligence model, demonstrates superior performance compared to pre-licensed human doctors across multiple medical reasoning benchmarks. The research, published in arXiv on 13 August 2025, represents the first systematic evaluation of GPT-5’s capabilities in multimodal medical decision support.
GPT-5 achieves exceptional performance on medical licensing examinations
The research team, led by corresponding author Dr Xiaofeng Yang from the Department of Radiation Oncology at Winship Cancer Institute, evaluated GPT-5 using zero-shot chain-of-thought reasoning across diverse medical question answering tasks. The model achieved remarkable accuracy rates of 95.84% on the United States Medical Licensing Examination (USMLE) MedQA dataset, representing a 4.80% improvement over its predecessor GPT-4o.
Most notably, GPT-5 scored an average of 95.22% across all three USMLE steps, with the largest performance gain of 4.17% observed in Step 2, which focuses on clinical decision-making and management. “These results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance,” the authors stated in their conclusion.
Multimodal reasoning capabilities show dramatic improvements
The study’s most striking findings emerged from multimodal tasks that require integration of visual and textual clinical information. On the challenging MedXpertQA multimodal benchmark, GPT-5 demonstrated substantial improvements over GPT-4o, with reasoning accuracy increasing by 29.26% and understanding scores improving by 26.18%.
The researchers evaluated the AI’s ability to analyse complex clinical scenarios combining patient narratives, laboratory results, and medical imaging. In one representative case study, GPT-5 successfully identified oesophageal perforation (Boerhaave syndrome) by synthesising CT imaging findings, laboratory values, and physical examination signs, then recommended appropriate diagnostic testing with a Gastrografin swallow study.
AI performance exceeds human medical experts
Perhaps most significantly, the study directly compared GPT-5’s performance against pre-licensed human medical experts. Whilst GPT-4o consistently underperformed human experts by 5.03% to 15.90% across various dimensions, GPT-5 substantially exceeded human performance by margins of 15.22% in text reasoning, 9.40% in text understanding, 24.23% in multimodal reasoning, and 29.40% in multimodal understanding.
“The magnitude of this lead is particularly striking in multimodal settings, where GPT-5’s unified vision-language reasoning pipeline appears to deliver an integration of textual and visual evidence that even experienced clinicians struggle to match under time-limited test conditions,” the researchers noted.
Comprehensive evaluation across multiple medical domains
The evaluation encompassed four major datasets spanning diverse medical specialties and reasoning types. These included MedQA from medical licensing examinations, the medical subset of the Massive Multitask Language Understanding (MMLU) benchmark, official USMLE self-assessment materials, the newly introduced MedXpertQA dataset covering 17 medical specialties, and VQA-RAD for radiology-specific visual question answering.
Across nearly all benchmarks, GPT-5 demonstrated consistent improvements over smaller model variants including GPT-5-mini and GPT-5-nano, as well as the previous generation GPT-4o. The model achieved near-ceiling performance exceeding 91% across all MMLU medical subdomains, with particularly notable gains in medical genetics (4.00% improvement) and clinical knowledge (2.64% improvement).
Clinical implications and future considerations
The researchers emphasise that these findings represent performance in controlled testing environments rather than real-world clinical practice. “It is important to recognize that these evaluations occur within idealized, standardized testing environments that do not fully encompass the complexity, uncertainty, and ethical considerations inherent in real-world medical practice,” they cautioned in their discussion.
However, the study’s authors suggest the results have significant implications for clinical decision support systems. “The advancements represented by GPT-5 mark a pivotal moment in the evolution of medical AI, bridging the gap between research prototypes and practical, high-impact clinical tools,” they concluded.
Unexpected findings in radiology-specific tasks
Interestingly, GPT-5 scored slightly lower (70.92%) than GPT-5-mini (74.90%) on the VQA-RAD radiology dataset. The researchers attributed this unexpected finding to potential scaling-related differences in reasoning calibration, suggesting that larger models might adopt more cautious approaches when selecting answers for smaller, domain-specific datasets.
The study represents the first comprehensive evaluation of GPT-5’s medical reasoning capabilities and establishes new benchmarks for AI performance in healthcare applications. The researchers have made their evaluation code publicly available to facilitate further research in this rapidly evolving field.
Reference
Wang, S., Hu, M., Li, Q., Safari, M., & Yang, X. (2025). Capabilities of GPT-5 on Multimodal Medical Reasoning. arXiv preprint arXiv:2508.08224v2. https://doi.org/10.48550/arXiv.2508.08224