ChatGPT struggles with complex medical diagnoses, study finds

A new study reveals that while ChatGPT shows promise in medical education, it falls short in diagnostic accuracy for complex clinical cases, raising concerns about its reliability as a standalone diagnostic tool.

 

ChatGPT

ChatGPT, the large language model that has taken the world by storm, has been touted as a potential game-changer in various fields, including healthcare. However, a study published July 31, 2024 in PLOS ONE [1] has shed light on the limitations of this artificial intelligence (AI) tool when it comes to complex medical diagnoses.

Researchers put to test

Researchers from Western University in Canada set out to evaluate ChatGPT’s performance as a diagnostic tool for complex clinical cases. The team, led by Dr Amrit Kirpalani from the Department of Paediatrics at Schulich School of Medicine and Dentistry, used 150 Medscape case challenges to assess the AI’s diagnostic accuracy, cognitive load, and overall relevance of its responses.

Diagnostic accuracy falls short

The results of the study were sobering. ChatGPT answered only 49% of the cases correctly, demonstrating an overall accuracy of 74%, with a precision of 48.67%, sensitivity of 48.67%, and specificity of 82.89%. The Area Under the Curve (AUC) was 0.66, indicating moderate discriminative ability between correct and incorrect diagnoses.

Dr Kirpalani and his colleagues noted that while ChatGPT excelled at ruling out incorrect diagnoses, it struggled with identifying the correct diagnosis consistently. This discrepancy highlights a significant concern for its use in clinical practice, where missed diagnoses could lead to serious consequences for patients.

Cognitive load and relevance

On a more positive note, the study found that ChatGPT’s responses were generally easy to understand, with 51% of answers considered low cognitive load and 52% deemed complete and relevant. This characteristic could potentially benefit medical students by facilitating improved learner engagement and information retention.

However, the researchers cautioned that the combination of ease of understanding with potentially incorrect or irrelevant information could result in misconceptions and a false sense of comprehension among users.

Strengths and weaknesses identified

The qualitative analysis of ChatGPT’s responses revealed several strengths, including its ability to provide clinical rationale, identify pertinent positive and negative findings, rule out specific differential diagnoses, and suggest future investigations.

Nevertheless, the study also uncovered significant weaknesses in the AI’s performance. These included difficulties in interpreting numerical values, inability to evaluate imaging results, struggles with nuanced diagnoses, occasional hallucinations (generation of incorrect or implausible information), and neglect of key information relevant to diagnoses.

Implications for medical education and practice

While ChatGPT shows potential as an educational tool, the researchers emphasised that its current form is not accurate enough to be relied upon as a diagnostic instrument. The study’s findings underscore the importance of human expertise in the diagnostic process and highlight the need for caution when using AI tools in healthcare settings.

Dr Kirpalani and his team suggest that future research should focus on enhancing the accuracy and reliability of ChatGPT as a diagnostic tool. They also call for the development of transparent guidelines for its clinical usage and advocate for training medical students and clinicians on how to effectively and responsibly employ such AI tools.

Ethical considerations

The study also touched upon important ethical considerations surrounding the use of AI in healthcare. These include concerns about patient privacy, data security, and the potential for algorithms to perpetuate existing biases present in their training data.

The researchers stressed the need for a clear legal framework addressing liability in cases of misdiagnosis involving AI tools. They emphasised that the overarching goal should be to ensure that AI serves as a tool to enhance, rather than replace, human expertise in medicine.

As AI continues to evolve and integrate into healthcare systems, studies like this one play a crucial role in shaping the future of patient care and medical training. The researchers call for further investigation into the long-term implications of using large language models like ChatGPT in healthcare and medical education.

While the potential benefits of AI in healthcare are significant, this study serves as a reminder of the importance of proceeding with caution and ensuring that these powerful tools are implemented in a responsible and ethical manner.

Reference:

  1. Hadi, A., Tran, E., Nagarajan, B., & Kirpalani, A. (2024). Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians. PLOS ONE, 19(7), e0307383. https://doi.org/10.1371/journal.pone.0307383