Keith J. Dreyer, DO, PhD

While wide use of data mining did not emerge until the prevalence of the Internet, it is steadily finding its niche in the world of health care and, more specifically, radiology. Defined as an information distilling process with the goal of extracting useable knowledge from radiology images and reports, radiology data mining is the cornerstone to better workflow optimization, outcomes analysis, and even the discovery of new diseases. The Massachusetts General Hospital Department of Radiology recently has developed a knowledge extraction technology known as LEXIMER (Lexicon Mediated Entropy Reduction), designed to optimize radiology data mining.

A modern radiological examination can generate up to thousands of images, as well as a wealth of detailed ordering and demographic data, and a lengthy text report from the radiologist. Embedded within this report may be numerous positive findings and recommendations. This is the data in its rawest form.

Depending on the size of the institution, a radiology department can perform between 100 and 5,000 examinations daily, generating a myriad of images, patient data, report text, findings, and recommendations. Massachusetts General Hospital’s radiology department has produced and archived billions of raw images and words. The sheer volume of data generated makes the task of distilling any knowledge a considerable computational challenge.

Here is a simple example of this conversion of data to knowledge. There are thousands of ways to describe an infarction of the brain within a radiology report. Size, shape, chronicity, severity, location, and hemorrhage are just a few examples of categorization axes used to organize the thousands of defining terms. If one wanted to search for all patients ever presented with acute hemorrhagic infarcts (say, to determine a deleterious side effect of a medication dosage) without a data mining system, the task would require meticulous chart review of millions of examinations. Alternatively, a radiology optimized data mining solution would resolve the thousands of expressive terms to simple categories including Infarction, Acute and Hemorrhagic, and subsequently provide accurate patient lists within seconds. This is how raw data is converted into information for the purpose of knowledge derivation.

Significant barriers must be overcome in order to extract meaning from data in a systematic way. Patient demographic, examination, image, text, and ordering data are typically located on disparate systems and can be difficult to extract, normalize, and integrate. Beyond this, image data presents unique obstacles. When compared to text, radiology images are enormous in size and highly variable over time. The manner in which CT was performed in 1985 greatly differs from how it is done today. Data mining applications must take these differences into account to be most effective. Another challenge is that the data itself is contained within busy clinical systems. A typical PACS is in constant use and quite difficult to mine for image data while in use as a clinical system without a priori classification and extraction.

Text report data also is resistant to effective data mining. Most radiology reports are minimally structured: a simple text field associated with an examination and a patient. Add to that the fact that the process is highly subjective: If two radiologists were to produce reports based on the interpretation of identical image data, they would probably never use the exact same words—even though they would most likely convey the same information. Radiologists can also be quite variable in their confidence of assertion, ie, certainty. This is particularly true when the interpreted images do not reveal a classic presentation or conclusive evidence of disease.

THE FIRST STEP

In general, radiology data is well organized but poorly structured, and structuring this data prior to knowledge extraction is an essential first step in the successful mining of radiological data. The structuring of image data can be achieved through image preprocessing and feature extraction, techniques often utilized in computer-assisted diagnosis (CAD). If a data preprocessing application can detect and extract the density of bone on CT or the thickness of myocardium on MRI, that information can then be stored and later mined endlessly by institutional researchers.

Pertinent text in a radiology report also can be extracted in much the same way. Using a process called Natural Language Understanding, it is now possible to extract the meaning of a sentence in a report and link it to standard nomenclatures such as Unified Medical Language System (UMLS), Systematized Nom-enclature of Human Medicine (SNOMED), or RadLex (see article “Straight Talk“). Various terms can be identified, classified, and consolidated to provide a common meaning. The use of Natural Language Understanding with subsequent mapping to standard taxonomies ultimately overcomes one of the major challenges of data mining: ambiguity.

LEXIMER enables the transformation of poorly structured radiology reports into manageable vehicles of information. Through simple queries, a facility can instantly extract, quantify, and display any radiological presentation of disease reported over the past 20 years, deriving a wealth of knowledge regarding the health of its patient population as well as the quality of care it provides.

Massachusetts General Hospital data mined 10 years of radiology reports using LEXIMER and was able to demonstrate that even with moderate increases in annual examination volume, positive findings rates and recommendation rates remained consistent. Specific modalities also can be tracked and organized based on positive findings and recommendations. Using this process, a health care system can begin to determine which examinations are ordered too often and by whom, as well as those that may need to be ordered more frequently.

OPTIMIZING WORKFLOW

The ability of technologies such as LEXIMER to extract meaning from data also has workflow implications for radiologists and their referrers. The possibilities include

Data Mining Uses

The following scenarios represent practical uses for data mining software in radiology.

  • A physician is treating a patient with symptoms that include headaches and coordination changes. The initial decision is to order a CT of the brain. But by utilizing a special computer application, the physician discovers an MRI would be a more effective procedure.
  • Using speech recognition, a radiologist dictates a report of an MRI examination of the neck. He notices a mass lesion in an anatomical region he is not completely familiar with. With a simple voice command, the radiologist requests the speech recognition application to instantaneously display an MRI atlas of the neck. The radiologist now identifies the mass as being contained within the levator veli palatini muscle and can even ask for further information on the lesion, including a complete list of potential differential diagnoses.
  • A director of radiology, challenged by a payor, needs to know the aggregate number of positive findings of her radiology department in the last 5 years as well as a detailed breakdown of each radiologist within the group. Within seconds, she searches through millions of reports and finds the exact number of positive findings for each radiologist, modality, and clinical indication.

Automated access to teaching files. If a physician asks to see all cases of multiple empendymomas of the brain, the intelligent query system at Massachusetts General Hospital that utilizes LEXIMER, known as Render, nearly instantaneously returns all pertinent studies and reports.

Triage cases based on severity. If a radiologist has 200 unread cases in their workflow queue, LEXIMER can classify the queue by predicted severity of findings.

Review only positive priors. If a patient has numerous previous reports on file, a radiologist may prefer to find and display only those priors with relevant positive findings using LEXIMER.

Assess follow-up. Using LEXIMER, all reports can be displayed for which recommendations were made for follow-up studies but never acted on, an application both radiologists and referring physicians would find useful. Physicians can also be notified by the system when a patient has not received a recommended examination in the prescribed time interval, such as a follow-up CT of the chest for a newly identified lung nodule.

Data mining can also be used as a quality control tool. Using LEXIMER, a health care facility could extract an individual radiologist’s history of positive findings and recommendations and compare it to those of their peers. Referring physicians can be evaluated on their ordering practices and ability to order examinations that result in positive versus negative findings. By mining large, historical banks of data, a system can help identify inappropriate ordering of high-cost imaging, thereby reducing a health care facility’s operational costs while improving consistency of care.

RESEARCH APPLICATIONS

Quality assurance and quality control can be similarly enhanced through the use of data mining. Commercial systems using LEXIMER can analyze interphysician reliability and consistency among radiologists and referring physicians. Such systems can extract an individual radiologist’s history of positive findings and recommendations, and compare them directly to those of their peers. Referring physicians can also be evaluated on their ordering practices and their ability to order examinations that result in positive, versus negative, findings. By mining large historical banks of ordering practice data, such a system can help identify inappropriate ordering of high-cost imaging procedures, which—when corrected—can ultimately reduce a health care facility’s operational costs while improving the consistency of its care process.

In addition to image data, health care facilities are rapidly gathering genetic data of their patient populations. A genotype is an individual’s genetic DNA composition, while a phenotype is represented by the features of health and disease expressed throughout life. As such, these phenotypes can be identified, in large part, by medical imaging. By studying an individual patient’s genotype and phenotype, it is difficult to correlate what diseases correspond to which DNA variations. But through Bayesian analysis of large populations, such information can be mined for consistent correlations. This is one area of research that is being conducted by the Harvard Partners Center for Genetics and Genomics, which is using the LEXIMER data extraction tool. Imagine a computational analysis of billions of bits of information that uncovers a high correlation between an image-based representation of disease and its specific DNA markers. Identifying such a correlation would enable us to search for and treat patients with those DNA markers even prior to the image-based presentation of their illness. Personalized therapy and molecular imaging protocols could be defined to track the progression of a disease. Disease discovery and treatment could be enhanced in ways never before imagined. And at the heart of this potential breakthrough is data mining.

While radiology data has been poorly mined in the past, proper data mining is providing valuable new information. Applications of this new information are destined to be an important force in medical imaging.

Keith J. Dreyer, DO, PhD, is vice chairman, Department of Radiology, Massachusetts General Hospital, Boston.