An AI chest X-ray foundation model for disease detection demonstrated racial and sex-related bias leading to uneven performance across patient subgroups and may be unsafe for clinical applications, according to a study published in Radiology: Artificial Intelligence, a journal of the Radiological Society of North America (RSNA). The study aims to highlight the potential risks for using foundation models in the development of medical imaging artificial intelligence.

“There’s been a lot of work developing AI models to help doctors detect disease in medical scans,” says lead researcher Ben Glocker, PhD, professor of machine learning for imaging at Imperial College London in the U.K. “However, it can be quite difficult to get enough training data for a specific disease that is representative of all patient groups.”

Due to the difficulty of collecting large volumes of high-quality training data, the AI field has moved toward using deep-learning foundation models that have been trained for other purposes. Foundation models are AI neural networks that have been trained on large, often unlabeled datasets which handle jobs from translating text to analyzing medical images.

“Despite their increasing popularity, we know little about potential biases in foundation models that could affect downstream uses,” Glocker says. Glocker’s research team compared the performance of a recently published chest X-ray foundation model and a reference model built by the team in evaluating 127,118 chest X-rays with associated diagnostic labels. The pre-trained foundation model was built with more than 800,000 chest X-rays from India and the U.S.

The researchers completed a comprehensive performance analysis to determine how well the models performed for individual subgroups. The 42,884 patients (mean age, 63; 23,623 male) in the study group included Asian, Black, and white patients. Bias analysis showed significant differences between features related to disease detection across biological sex and race.

“Our bias analysis showed that the foundation model consistently underperformed compared to the reference model,” Glocker says. “We observed a decline in disease classification performance and specific disparities in protected subgroups.”

Significant differences were found between male and female and Asian and Black patients in the features related to disease detection. Compared with the average model performance across all subgroups, classification performance on the ‘no finding’ label dropped between 6.8% and 7.8% for female patients, and performance in detecting ‘pleural effusion’—a buildup of fluid around the lungs—dropped between 10.7% and 11.6% for Black patients.

“Dataset size alone does not guarantee a better or fairer model,” Glocker says. “We need to be very careful about data collection to ensure diversity and representativeness.” He notes that it’s important that foundation models are published and shared.

“To minimize the risk of bias associated with the use of foundation models for clinical decision-making, these models need to be fully accessible and transparent,” he says. Glocker is an advocate for comprehensive bias analysis as an integral part of the development and auditing of foundation models.

“AI is often seen as a black box, but that’s not entirely true,” he says. “We can open the box and inspect the features. Model inspection is one way of continuously monitoring and flagging issues that need a second look.”

The work doesn’t start with the AI model, it starts with the data used to build it, Glocker notes. “As we collect the next dataset, we need to, from day one, make sure AI is being used in a way that will benefit everyone,” he says.