Healthcare LLM AI systems show dangerous bias against women patients
Large Language Model (LLM) artificial intelligence systems increasingly deployed across healthcare settings pose serious risks to patient safety, with new research revealing these tools systematically recommend reduced medical care for women and vulnerable populations based on irrelevant factors such as typing errors and communication style rather than clinical need.
Researchers from the Massachusetts Institute of Technology have uncovered alarming evidence that large language models (LLMs) used for clinical decision-making exhibit dangerous biases that could compromise patient outcomes. The study, presented at The 2025 ACM Conference on Fairness, Accountability, and Transparency, in Athens, Greece on 23 June 2025, demonstrates that LLM-based AI systems currently being piloted in hospitals worldwide make treatment recommendations based on non-medical factors including gender, writing style, and even typographical errors.
Women face systematic discrimination in LLM AI recommendations
The research reveals a troubling pattern of gender-based discrimination, with female patients experiencing approximately 7% more errors in treatment recommendations compared to male patients. Most critically, AI systems were significantly more likely to advise women to manage serious medical conditions at home rather than seek professional care, even when clinical evidence suggested medical intervention was necessary.
“These models are often trained and tested on medical exam questions but then used in tasks that are pretty far from that, like evaluating the severity of a clinical case. There is still so much about LLMs that we don’t know,” warned lead author Abinitha Gourabathina, an EECS graduate student.
The bias persisted even when researchers removed explicit gender markers from patient communications, indicating that AI systems use subtle linguistic cues to infer patient demographics and subsequently alter their medical recommendations – a potentially life-threatening flaw in systems designed to provide objective clinical guidance.
Typing errors trigger dangerous care reductions
The MIT team’s analysis of four major LLM systems revealed that seemingly innocuous factors dramatically altered medical recommendations. Patients whose messages contained typos, extra whitespace, informal language, or uncertain phrasing were 7-9% more likely to be advised against seeking medical care compared to those with perfectly formatted communications.
This finding is particularly concerning given that vulnerable populations – including elderly patients, those with limited English proficiency, individuals with disabilities, or people experiencing health anxiety – are more likely to communicate in ways that trigger these AI biases.
Associate Professor Marzyeh Ghassemi from MIT’s Department of Electrical Engineering and Computer Science emphasised the gravity of these findings: “This work is strong evidence that models must be audited before use in health care – which is a setting where they are already in use.”
Real-world deployment amplifies risks
The research extends beyond theoretical concerns, as healthcare systems worldwide are already implementing these AI tools for patient triage, clinical note generation, and treatment recommendations. Major electronic health record provider Epic has deployed GPT-4 for patient communication assistance, whilst numerous pilot programmes are underway in hospitals globally.
The study found that conversational AI interfaces – increasingly common in patient-facing applications – amplify these problematic patterns. Clinical accuracy degraded by approximately 7% across all tested scenarios when patients interacted with AI systems in realistic communication settings.
Critical care recommendations compromised
Perhaps most alarming, the research identified instances where AI systems recommended self-management for patients with serious medical conditions requiring immediate professional attention. The “colourful” language perturbations – incorporating dramatic expressions typical of patients in distress – had the most significant negative impact on care recommendations.
“In research, we tend to look at aggregated statistics, but there are a lot of things that are lost in translation. We need to look at the direction in which these errors are occurring – not recommending visitation when you should is much more harmful than doing the opposite,” Gourabathina explained.
Healthcare disparities risk amplification
The findings suggest that AI deployment in healthcare may inadvertently worsen existing health disparities. The research specifically tested communication patterns associated with vulnerable populations, including patients with health anxiety, limited technological literacy, and those using gender-neutral pronouns.
All these groups experienced reduced quality of AI-generated medical recommendations, raising serious concerns about equitable healthcare access as AI systems become more prevalent in clinical settings.
Urgent need for safety protocols
The authors stressed that current AI evaluation methods, which typically focus on medical examination performance, fail to capture these real-world biases that could endanger patient safety. The research demonstrates that “aggregate clinical accuracy is not a good indicator of LLM reasoning and that certain stylistic choices or formatting changes can hinder LLM reasoning capabilities.”
In their discussion, the authors warned: “LLMs deployed for patient systems may be especially vulnerable to superficial changes in input text – posing a serious concern for the reliability and equity of LLMs in patient-facing tools.”
Call for immediate action
The research team advocates for immediate implementation of comprehensive bias testing before any further AI deployment in healthcare settings. They emphasise that the observed disparities “reflect overall brittleness in the model’s behaviour” and represent “critical insights for the deployment of patient-AI systems.”
“Though LLMs show great potential in healthcare applications, we hope that our work inspires further study in understanding clinical LLM reasoning, consideration of the meaningful impact of non-clinical information in decision-making, and mobilisation towards more rigorous audits prior to deploying patient-AI systems,” the authors concluded.
This research serves as an urgent wake-up call for healthcare institutions rushing to implement AI systems without adequate safety testing, highlighting the potential for these tools to cause real harm to vulnerable patients.
Reference:
Gourabathina, A., Gerych, W., Pan, E., & Ghassemi, M. (2025). The Medium is the Message: How Non-Clinical Information Shapes Clinical Decisions in LLMs. The 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘25), June 23–26, 2025, Athens, Greece. https://doi.org/10.1145/3715275.3732121