MLCommons, a global engineering consortium focused on advancing machine learning, has achieved a significant milestone in the field of medical artificial intelligence (AI). The consortium announced the publication of “Federated Benchmarking of Medical Artificial Intelligence with MedPerf” in Nature Machine Intelligence. MedPerf is an open benchmarking platform that enables the evaluation of AI models using diverse real-world medical data while prioritizing patient privacy and addressing legal and regulatory concerns.
The development of MedPerf is the result of a two-year collaboration involving experts from over 20 companies, 20 academic institutions, and nine hospitals across 13 countries, spearheaded by the MLCommons Medical Working Group. The platform aims to enhance medical AI by evaluating models on large and diverse datasets, improving effectiveness, reducing bias, building public trust, and supporting regulatory compliance.
One of the challenges in developing medical AI models is the lack of generalizability due to training on limited and specific clinical settings, leading to unintended bias and reduced real-world impact. MedPerf addresses this issue by providing AI researchers with access to diverse medical data from around the world, promoting improved generalizability and clinical efficacy.
Moreover, MedPerf utilizes federated evaluation, enabling healthcare organizations to assess and validate AI models without compromising patient data privacy. Models are deployed and evaluated remotely within the premises of data providers, building trust among stakeholders, and fostering collaboration.
The platform’s orchestration and workflow automation capabilities significantly accelerate federated learning studies, streamlining research timelines from months to hours, officials say. This efficiency was demonstrated in the Federated Tumor Segmentation Challenge, where MedPerf successfully benchmarked 41 different models across 32 sites globally.
“Our goal is to use benchmarking as a tool to enhance medical AI,” says Alex Karargyris, PhD, MLCommons Medical co-chair. “Neutral and scientific testing of models on large and diverse datasets can improve effectiveness, reduce bias, build public trust, and support regulatory compliance.”