Summary: National Institutes of Health researchers found that while an AI model accurately answered medical quiz questions, it often made errors in image descriptions and reasoning, highlighting both its potential and limitations in clinical settings.

Key Takeaways

  1. NIH researchers found that an AI model accurately answered medical quiz questions but frequently made errors in describing images and explaining its decision-making process.
  2. The study highlights AI’s potential for faster diagnoses in healthcare but underscores that it cannot replace the crucial human experience needed for accurate diagnosis.
  3. Further evaluation and research are needed to fully understand and harness the potential of multimodal AI technology in clinical settings.

——————————————————————————————————————————————————

Researchers at the National Institutes of Health (NIH) found that an artificial intelligence (AI) model accurately answered medical quiz questions designed to test health professionals’ diagnostic skills based on clinical images and brief text summaries. However, physician-graders noted that the AI model often made errors in describing images and explaining its decision-making process. The findings, published in npj Digital Medicine, highlight AI’s potential and limitations in clinical settings.

“AI integration into healthcare holds promise for faster diagnoses, allowing earlier treatment,” says NLM Acting Director Stephen Sherry, PhD. “However, as this study shows, AI cannot yet replace the crucial human experience needed for accurate diagnosis.”

AI Diagnoses Fast but Lacks Accuracy

The AI model and human physicians answered questions from the New England Journal of Medicine’s (NEJM) Image Challenge, an online quiz using clinical images and short text descriptions to diagnose multiple-choice answers. The AI was tasked with 207 questions and asked to provide a rationale for each answer, including image descriptions, relevant medical knowledge summaries, and step-by-step reasoning.

Nine physicians from various specialties participated, first answering questions without external resources (“closed-book”) and then with resources (“open-book”). They then reviewed the correct answers, the AI’s answers, and its rationales, scoring the AI on its image description, medical knowledge summary, and reasoning.

AI Outperforms Physicians in Closed-Book Tests

The study found both the AI model and physicians scored highly in selecting correct diagnoses. The AI performed better than physicians in closed-book settings, but physicians with open-book tools outperformed the AI, especially on the most challenging questions. Despite making correct final choices, the AI often erred in describing images and reasoning, as in a case where it misinterpreted lesions on a patient’s arm.

The researchers emphasize the need for further evaluation of multimodal AI technology before clinical implementation. “This technology could augment clinicians’ capabilities with data-driven insights,” says NLM Senior Investigator Zhiyong Lu, Ph.D. “Understanding the risks and limitations is essential to harnessing its potential in medicine.”

The study used GPT-4V, a multimodal AI model that processes text and images. While the study is small, it highlights the potential of multimodal AI to aid physicians’ decision-making, necessitating more research to compare AI and physician diagnostic abilities.