Lisa C. Adams, MD, a radiologist at the Technical University of Munich, contributed to a study comparing Meta Llama 3, an open-source language model, with proprietary models like GPT-4. The study, published in Radiology, highlighted Meta Llama 3’s strong performance on radiology board-style exam questions, showcasing the potential of open-source AI in healthcare. Below, Adams discusses the findings, challenges, and future implications of this research for the field of radiology.
AXIS Imaging News: What were the most surprising findings in the comparison between Meta Llama 3 and larger proprietary models like GPT-4 and Claude-Opus?
Lisa C. Adams, MD: The most surprising finding was Llama 3 70B’s performance matching leading proprietary models like GPT-4 Turbo and Claude 3 Opus on radiology board-style examination questions, despite having fewer parameters. It achieved 80% accuracy on board-style questions and 74% on ACR in-training tests, comparable to its proprietary counterparts.
AXIS: How do you think the open-source nature of Meta Llama 3 will influence its adoption in healthcare settings, especially given the concerns about privacy and stability with proprietary models?
Adams: The open-source nature of Llama 3 could significantly increase its adoption in healthcare. It allows for local operation within hospitals, addresses privacy concerns, and provides greater stability through controlled updates. The potential for customization enables the creation of specialized clinical models. Lower operating costs due to fewer parameters and optimization techniques make it an attractive option for healthcare institutions seeking AI solutions with data control.
Open models also allow for greater explainability. Researchers can explore the weights, test different model outputs (without being constrained by the API), or even customize the model. Over time, this is likely to make the open-source model more trustworthy, as everything can be reproduced and is fully traceable.
AXIS: Could you discuss any specific challenges or limitations the researchers faced when evaluating Meta Llama 3 on radiology board-style exam questions?
Adams: The multiple-choice format limited assessment of broader clinical reasoning. More nuanced benchmarks are needed to evaluate disease knowledge and guideline adherence. There was a risk of data contamination for publicly available questions. Crucially, the study didn’t assess image interpretation abilities, a critical skill in radiology.
AXIS: How do you think Meta Llama 3’s performance on these exam questions might translate to real-world clinical applications in radiology?
Adams: Llama 3 exam performance suggests potential for clinical applications such as decision support or report generation. However, real-world translation is limited because the exam questions don’t reflect the clinical decision complexity and ambiguity of real clinical cases. The lack of image interpretation assessment is a significant limitation for radiology applications. However, Llama 3 shows that there is a wealth of knowledge in the model that may be useful for clinical practice if used correctly with the right safeguards.
AXIS: What implications do these findings have for the future development of open-source LLMs in healthcare, particularly in specialized fields like radiology?
Adams: Our findings demonstrate the viability of open-source models in healthcare, which may encourage further investment and development. The potential for customization could lead to deeper integration of models into clinical workflows, perhaps even allowing for fine-tuned models tailored to specific clinical use cases. However, future research should focus on developing clinically relevant assessment methods that go beyond multiple-choice questions and combine real clinical cases with image data and other clinical data.
AXIS: Were there any types of radiology questions where Meta Llama 3 struggled or outperformed other models? How might this influence future model training?
Adams: As this was a proof-of-concept study, we did not assess specific question types. As a result, we decided to focus on overall performance across a range of radiology board-style examination questions rather than breaking down results by question type or subject area.
AXIS: How do you anticipate the competition between open-source and proprietary LLMs will evolve in healthcare, considering Meta Llama 3’s performance in this study?
Adams: Competition between open-source and proprietary LLMs in healthcare is likely to intensify. Proprietary models have more funding, which gives them an advantage. However, the open-source community is strong and knowledgeable and is steadily closing the gap between open and closed source, not only in medicine. Models like Llama 3 also allow smaller companies to quickly develop competitive models without much funding, fostering competition in healthcare that is likely to be beneficial to all.
Open-source models may take advantage of privacy and customization, while proprietary models may focus on unique features or superior performance. Both types are likely to develop specialized healthcare versions. The newly released larger version of Llama 3 may further change the landscape. Regulatory frameworks may evolve to address the challenges posed by both types of healthcare models. Currently, regulations are also facilitating open-source efforts, which may benefit model development but not certification.