A recent UC San Francisco study finds that ChatGPT recommends unnecessary treatments in emergency departments, underscoring the need for improved AI frameworks in medical decision-making.
While artificial intelligence continues to make strides in various fields, a new study from UC San Francisco highlights critical limitations in its application to emergency medical care. The study, published in Nature Communications, found that ChatGPT, a large language model developed by OpenAI, often suggests unnecessary interventions in emergency departments (ED).
AI Struggles With Complex Medical Decisions
The research team tested ChatGPT-3.5 and ChatGPT-4 using a dataset of 1,000 emergency department visits. Each visit included patient symptoms and examination findings from doctors’ notes.
The AI’s recommendations for admissions, radiology scans and antibiotics were then compared with actual clinical decisions. Notably, ChatGPT-4 was found to be 8% less accurate than resident physicians, while ChatGPT-3.5 lagged by 24%.
While previous research by Williams showed that ChatGPT could sometimes outshine humans in straightforward diagnostic tasks, the current study underscores the model’s limitations in complex clinical decision-making.
“This is a valuable message to clinicians not to blindly trust these models,” lead author Chris Williams, a postdoctoral scholar at UC San Francisco, said in a news release. “ChatGPT can answer medical exam questions and help draft clinical notes, but it’s not currently designed for situations that call for multiple considerations, like the situations in an emergency department.”
Overprescription and Its Implications
The issue appears rooted in the training data. ChatGPT’s tendency to overprescribe might stem from its foundation on internet-sourced information, which leans toward cautionary advice suitable for general public safety rather than the precise needs of an ED setting.
“These models are almost fine-tuned to say, ‘seek medical advice,’ which is quite right from a general public safety perspective,” Williams added. “But erring on the side of caution isn’t always appropriate in the ED setting, where unnecessary interventions could cause patients harm, strain resources and lead to higher costs for patients.”
The Path Forward
Future advancements in AI for medical applications will require more nuanced frameworks to evaluate clinical information. Designers of these frameworks must strike a balance, ensuring the AI does not miss serious conditions while also avoiding unnecessary tests and treatments.
“There’s no perfect solution,” Williams said. “But knowing that models like ChatGPT have these tendencies, we’re charged with thinking through how we want them to perform in clinical practice.”
The Study’s Broader Impact
The study offers critical insights as the health care industry increasingly explores AI-based solutions. It calls for a more collaborative approach among AI researchers, clinicians and policymakers to develop safer and more efficient health care technologies.
This study invites a re-evaluation of AI applications in emergency departments, aiming for a future where generative AI can complement, rather than complicate, crucial medical decision-making processes.