Summary: MIT researchers found that AI models used for medical diagnoses, especially in radiology, show significant performance discrepancies across different demographic groups, leading to less accurate diagnoses for women and people of color, despite their overall high accuracy.

Key Takeaways:

  1. Performance Discrepancies in AI Models: MIT researchers identified significant performance discrepancies in AI models used for medical diagnoses, with reduced accuracy for women and people of color, highlighting the need for fairness in AI applications as their use in healthcare grows.
  2. Demographic Shortcuts and Fairness Gaps: The study revealed that AI models with high accuracy in predicting demographic information, such as race and gender, exhibited the largest fairness gaps in diagnostic accuracy, indicating the use of “demographic shortcuts” that lead to biased outcomes.
  3. Importance of Local Evaluation and Training: The researchers emphasized the necessity for hospitals to evaluate AI models on their own patient populations and, whenever possible, to train models on local data to ensure fair and accurate medical diagnoses across all demographic groups.

————————————————————————————————————————————————————————

Researchers at the Massachusetts Institute of Technology (MIT) have identified significant performance discrepancies in artificial intelligence (AI) models used for medical diagnoses, particularly in their accuracy across different demographic groups. These findings are crucial as AI’s role in healthcare continues to grow, with the U.S. FDA approving 882 AI-enabled medical devices by May 2024, 671 of which are for radiology.

Artificial intelligence models have shown remarkable capabilities in medical imaging, often outperforming human experts in specific tasks. For example, a 2022 study by MIT researchers demonstrated that AI could predict a patient’s race from chest X-rays with high accuracy, a task even skilled radiologists cannot perform. However, these models have also revealed inherent biases that result in less accurate diagnoses for women and people of color.

AI Fairness Gaps in Medical Diagnostics

A new study led by MIT’s Marzyeh Ghassemi, associate professor of electrical engineering and computer science, found that the AI models most accurate at predicting demographic information also exhibit the largest fairness gaps. These gaps represent discrepancies in diagnostic accuracy between different races and genders.

“It’s well-established that high-capacity machine-learning models are good predictors of human demographics such as self-reported race or sex or age. This paper re-demonstrates that capacity, and then links that capacity to the lack of performance across different groups, which has never been done,” Ghassemi says.

The research indicates that these models might be using “demographic shortcuts,” which result in incorrect diagnoses for women, Black people, and other groups. Despite these challenges, the researchers discovered methods to retrain the models to improve fairness. However, the success of these debiasing techniques varied depending on the data used. Models retrained on data from the same hospital as the test data showed improved fairness, but when applied to patients from different hospitals, the fairness gaps reappeared.

Evaluate AI Models Locally to Ensure Fairness

“I think the main takeaways are, first, you should thoroughly evaluate any external models on your own data because any fairness guarantees that model developers provide on their training data may not transfer to your population. Second, whenever sufficient data is available, you should train models on your own data,” says Haoran Zhang, an MIT graduate student and one of the lead authors of the study. This sentiment is echoed by Yuzhe Yang, a co-lead author, and supported by contributions from Judy Gichoya of Emory University and Dina Katabi of MIT.

The study used publicly available chest X-ray datasets from Beth Israel Deaconess Medical Center to train models on predicting three medical conditions. While overall performance was high, fairness gaps were evident, with models more accurately predicting for certain demographics over others. The researchers employed two debiasing strategies: optimizing for subgroup robustness and removing demographic information using group adversarial methods. Both approaches showed promise within the same dataset but faltered when applied to external datasets.

Addressing Fairness in AI Medical Diagnoses 

“Many popular machine learning models have superhuman demographic prediction capacity—radiologists cannot detect self-reported race from a chest X-ray,” Ghassemi says. “These are models that are good at predicting disease, but during training are learning to predict other things that may not be desirable.”

The findings underscore the importance of evaluating AI models on local patient populations before deployment to ensure accurate and fair medical diagnoses across all demographic groups. As hospitals increasingly adopt AI technologies, it is crucial to address these fairness issues to avoid perpetuating healthcare disparities.