AI’s Real-World Struggles in Medical Conversations: Harvard and Stanford Study

A study from Harvard Medical School and Stanford University shows that while AI models like ChatGPT excel on medical exams, they struggle with real-world doctor-patient interactions, highlighting the need for more realistic evaluation frameworks and improvements in AI capabilities.

Artificial intelligence tools like ChatGPT have been heralded for their potential to reduce clinician workloads by triaging patients, taking medical histories and providing preliminary diagnoses. These tools, referred to as large-language models, are already in use by patients to understand symptoms and medical test results.

However, a recent study led by researchers at Harvard Medical School and Stanford University, published in Nature Medicine, reveals that despite their stellar performance on standardized medical tests, AI models falter in real-world medical conversations. 

To explore this, researchers developed an evaluation framework named CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) to assess AI performance under conditions that mimic actual patient interactions.

The results showed a significant drop in performance when these AI models were subjected to more fluid, back-and-forth conversational scenarios typical of real-world medical settings.

“Our work reveals a striking paradox — while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor’s visit,” senior author Pranav Rajpurkar, an assistant professor of biomedical informatics at Harvard Medical School, said in a news release. “The dynamic nature of medical conversations — the need to ask the right questions at the right time, to piece together scattered information and to reason through symptoms — poses unique challenges that go far beyond answering multiple-choice questions. When we switch from standardized tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy.”

Currently, AI models are usually tested with multiple-choice medical questions derived from national exams or certification tests for medical residents. These questions assume that all relevant information is presented clearly, often simplifying the diagnostic process.

In contrast, real-world interactions are more chaotic, with patients providing scattered and incomplete information.

Shreya Johri, a doctoral student in Rajpurkar’s lab at Harvard Medical School and co-first author of the study, emphasized the need for a more realistic testing process.

“We need a testing framework that reflects reality better and is, therefore, better at predicting how well a model would perform,” she said in the news release.

CRAFT-MD aims to meet this need by simulating real-world interactions where AI models must collect information about symptoms, medications and family history before making a diagnosis. The evaluation also included human experts who assessed the outcomes, analyzing the AI’s ability to gather relevant patient information, diagnostic accuracy when faced with scattered information and adherence to prompts.

The study tested four AI models on 2,000 clinical vignettes representing common conditions in primary care and across 12 medical specialties.

The findings revealed that the models struggled particularly with conducting clinical conversations and reasoning based on information provided by patients. Their accuracy declined when faced with open-ended information as opposed to multiple-choice questions.

The research team recommends several actions to improve AI models’ performance in real-world settings. These include designing AI to handle conversational, open-ended questions, assessing their ability to extract critical information and integrating non-textual data like images and EKGs. Additionally, sophisticated AI agents capable of interpreting non-verbal cues, such as facial expressions and body language, are essential.

“As a physician scientist, I am interested in AI models that can augment clinical practice effectively and ethically,” added co-senior author Roxana Daneshjou, an assistant professor of biomedical data science and dermatology at Stanford University. “CRAFT-MD creates a framework that more closely mirrors real-world interactions and thus it helps move the field forward when it comes to testing AI model performance in health care.”

In the future, optimization and periodic updates to the CRAFT-MD framework are expected to enhance the interaction between patient and AI models further. Such advancements hold the promise of better preparing AI to serve in clinical settings, ultimately aiming to improve patient care and clinical outcomes.