“AI Struggles in Medical Imaging: Simple Investigation Reveals Poor Performance Compared to Random Chance”

Stay updated with our daily and weekly newsletters, offering the latest news and exclusive content on leading AI developments.

Large language models (LLMs) and large multimodal models (LMMs) are starting to be used in medical settings. However, these groundbreaking technologies haven’t yet been thoroughly tested in such critical areas. Researchers from the University of California at Santa Cruz and Carnegie Mellon University recently examined how reliable LMMs are in medical diagnostics by asking both general and specific diagnostic questions. They also explored whether these models were being evaluated correctly for medical purposes.

By creating a new dataset and questioning state-of-the-art models about X-rays, MRIs, and CT scans of human abdomens, brains, spines, and chests, they found significant drops in performance. Even advanced models like GPT-4V and Gemini Pro performed no better than random guesses when identifying conditions and positions. Introducing adversarial pairs — slight variations in the data — further reduced model accuracy by an average of 42%.

Medical Visual Question Answering (Med-VQA) is a method used to evaluate how well models interpret medical images. Although LMMs have shown progress on benchmarks like VQA-RAD, they quickly fail under more in-depth examinations. The researchers introduced a new dataset called Probing Evaluation for Medical Diagnosis (ProbMed), which included 6,303 images from common biomedical datasets, featuring scans of the abdomen, brain, chest, and spine.

Using GPT-4, they extracted metadata about existing abnormalities, the names of those conditions, and their locations. This created 57,132 question-answer pairs covering organ identification, abnormalities, clinical findings, and reasoning around positions. Seven advanced models, including GPT-4V, Gemini Pro, and open-source models like LLaVAv1 and MiniGPT-v2, were then tested. Results showed that even the most robust models experienced at least a 10.52% drop in accuracy with ProbMed, with an average decrease of 44.7%. For example, LLaVA-v1-7B saw a dramatic 78.89% decrease in accuracy.

The researchers found that both GPT-4V and Gemini Pro struggled with specialized diagnostic questions and often accepted false conditions and positions. In particular, GPT-4V’s accuracy dropped to 36.9% for questions about conditions or findings, and Gemini Pro’s accuracy was about 26% for position-related questions. Additionally, 76.68% of Gemini Pro’s errors resulted from hallucinations.

Specialized models like CheXagent, which focuses solely on chest X-rays, were more accurate in determining abnormalities and conditions but struggled with general tasks like organ identification. Interestingly, CheXagent could transfer its expertise to identify conditions in chest CT scans and MRIs, showing some potential for cross-modality expertise transfer in real-life scenarios.

The study emphasizes the urgent need for more robust evaluations to ensure that LMMs are reliable for critical fields like medical diagnosis. Current LMMs are still far from being applicable in these high-stakes areas.

In the medical and research community, many agree that AI is not ready for medical diagnosis. For example, Dr. Heidy Khlaaf pointed out that deploying LLMs in safety-critical infrastructure is dangerous, noting that these systems need at least 99% reliability but currently perform worse than random guessing. Other community members echoed similar concerns, highlighting the expertise that human professionals bring to the table, which AI cannot yet replicate.