July 5, 2024

A Study Evaluates the Potential of GPT-4 to Perpetuate Biases in Clinical Decision Making

GPT-4, a large language model (LLM) such as ChatGPT, has shown promise in various clinical applications, from automating administrative tasks to supporting clinical decision making. However, a new study conducted by investigators from Brigham and Women’s Hospital raises concerns about the potential of GPT-4 to encode and perpetuate biases that can negatively affect marginalized groups.

The study, published in The Lancet Digital Health, aims to systematically assess whether GPT-4 exhibits racial and gender biases that could impact its effectiveness in supporting clinical decision making.

According to the corresponding author Emily Alsentzer, Ph.D., there is significant excitement about using LLMs like GPT-4 not only for documentation or administrative tasks but also for clinical decision making. Therefore, the researchers wanted to thoroughly evaluate whether GPT-4 is prone to encoding biases that might hinder its ability to support clinical decision making.

To assess this, Alsentzer and her colleagues tested four different applications of GPT-4 using the Azure OpenAI platform. Firstly, they prompted GPT-4 to generate patient vignettes for medical education. Next, they tested GPT-4’s capability to develop accurate differential diagnoses and treatment plans for various patient cases using an NEJM Healer tool. Finally, they evaluated how GPT-4 made inferences about a patient’s clinical presentation using case vignettes designed to measure implicit bias. In each application, the researchers analyzed whether GPT-4’s outputs were biased based on race or gender.

In the medical education task, the researchers asked GPT-4 to generate a patient presentation for a given diagnosis using ten different prompts. They found that GPT-4 exaggerated known differences in disease prevalence across different demographic groups. For example, when prompted to generate a vignette for a patient with sarcoidosis, GPT-4 described a Black woman 81% of the time. Alsentzer clarified that while sarcoidosis is indeed more prevalent in Black patients and in women, it should not account for 81% of all patients.

During the NEJM Healer task, where GPT-4 was asked to develop a list of possible diagnoses, changing the gender or race/ethnicity of the patient significantly affected its ability to prioritize the correct top diagnosis in 37% of the cases. In some instances, GPT-4’s decision making reflected gender and racial biases found in the literature. For example, the model ranked panic attack/anxiety as a more likely diagnosis for women than men in the case of pulmonary embolism. It also ranked sexually transmitted diseases, such as acute HIV and syphilis, as more likely for patients from racial minority backgrounds compared to white patients.

When evaluating subjective patient traits like honesty, understanding, and pain tolerance, GPT-4 produced different responses based on race, ethnicity, and gender for 23% of the questions. For instance, GPT-4 was more likely to rate Black male patients as abusers of the opioid Percocet compared to Asian, Black, Hispanic, and white female patients. All answers should have been identical for all simulated patient cases.

The study acknowledges its limitations, such as testing GPT-4’s responses using a limited number of prompts and analyzing model performance based on traditional demographic categories. The researchers suggest that future work includes investigating biases using real clinical notes from electronic health records.

While LLM-based tools currently require clinician verification, Alsentzer emphasizes the difficulty for clinicians to detect systemic biases when viewing individual patient cases. She underlines the importance of conducting bias evaluations for each application of LLMs in the medical field, initiating a conversation about GPT-4’s potential to propagate biases in clinical decision support applications.

The study also includes additional authors from Brigham and Women’s Hospital, such as Jorge A Rodriguez, David W Bates, and Raja-Elie E Abdulnour, along with authors from other institutions, including Travis Zack, Eric Lehman, Mirac Suzgun, Leo Anthony Celi, Judy Gichoya, Dan Jurafsky, Peter Szolovits, and Atul J Butte.

*Note:
1. Source: Coherent Market Insights, Public sources, Desk research
2. We have leveraged AI tools to mine information and compile it