Should pay-for-performance initiatives be based on outcomes research?

When asked this question, radiologist Rebecca Smith-Bindman, MD, associate professor in epidemiology and biostatistics at the University of California, San Francisco (UCSF), responds simply, “How else?”

But as Smith-Bindman herself admits, determining what radiological outcomes to agree on, how they are to be measured, and how effectively is a long road on which only the first steps have been taken. “For most of radiology, we just have not yet done the studies to establish good process-of-care measures,” she says.

But now, because of a Congressional mandate agreed to by the American Medical Association, not just radiologists but also most physicians, physician groups, hospitals, and health care providers must prepare metrics—measurable processes, including some sort of outcome—so that Medicare and other payors can reward those practicing quality care.

Of course, defining what quality care is and how it is to be rewarded is the thorny little problem that remains to be solved. The footprints down this path are plain to see, however, and it would be a mistake, according to those interviewed for this story, to fall back on complacency. The watchword is attentiveness. Get ready for pay for performance—now being shorthanded in journals as P4P—because it is coming, and probably sooner than you think.

Michael J. Pentecost, MD, is director of the Mid-Atlantic Medical Group radiology practice for Kaiser Permanente’s facilities in Maryland, Virginia, and Washington, DC. He says his job includes managing 31 sites that produce about a half million imaging examinations annually. Pentecost is a long-time activist in the American College of Radiology (ACR), Reston, Va. He is a former member of the Board of Chancellors, and last year, he was named chair of the ACR’s Institute for Health Policy in Radiology. He is also a member of an informal collection of physicians dubbed the Sun Valley Group, which has been meeting yearly in Sun Valley, Idaho, to discuss performance measurement practices for radiologists.

According to Pentecost, three major medical entities are spearheading P4P for all physicians, radiologists included. The first is the AMA’s Physician Consortium for Performance Improvement. The Consortium, Pentecost says, has been “charged by Congress to develop about 140 P4P performance measurements, and they’ve done about 100 so far. The Consortium,” of which he is a member, Pentecost adds, “is the group that creates the metrics.”

Once the metrics—performance measurements—have been created, they are then sent to the National Quality Forum (NQF), a nonprofit health care policy group. At the NQF, the metrics of the Consortium are approved—not changed but screened for approval, Pentecost says. From the NQF, the metrics are sent to the Ambulatory Care Quality Alliance (AQA). The AQA is also nonprofit and includes Medicare and insurance and health care providers like Kaiser. The AQA “takes the last step and recommends metrics for implementation,” Pentecost says.

After that comes what Pentecost describes as a “long process” between physicians, insurance companies, and other industry players to iron out the metrics and how they are to be used.

“We anticipate the Centers for Medicare and Medicaid Services increasingly using these metrics,” Pentecost says. “For instance, they may not pay unless a [vascular] study includes a measurement of the carotid artery. That [type of application of a metric] is beginning now, and that’s P4P in medicine.”

Pentecost says that Medicare has targeted about 5.7% of its funds to be used for P4P, including rewards to doctors for quality care as P4P is rolled out. “That’s big bucks,” Pentecost says. He adds that eventually as much as 30% of Medicare funding may fall into the P4P category.

The stakes are high. The ACR is working with numerous physician organizations to design workable metrics, and is now recruiting for a full-time P4P specialist, Pentecost says.

He says that the key factor for radiologists is not radiology’s own metrics so much as whether the metrics that ultimately are approved exclude certain radiological tests.

He uses the example of a virtual colonoscopy (VC). If it is not included in the metrics developed to screen for colon cancer, and a colonoscopy is the metric instead, then primary care physicians (PCPs) will not be referring for VC.

“That’s where the rubber meets the road,” Pentecost says. If the PCP, to get a P4P bonus, needs to screen a certain percentage of patients for colon cancer, and the VC is not in the screening metrics, radiologists will not get referrals. For that test at least, they will be out in the cold.

Outcomes Critical

To make the case that radiological tests and other processes and procedures are included in the P4P metrics, radiologists, like other providers, will have to focus on outcomes. What happens with the patient? Just what are the characteristics of a radiologic test that are to be rewarded under P4P? What is a radiologic outcome?

Pentecost, like most, distinguishes between different classes of outcomes. He divides them into “process” measures (how quickly or correctly is the patient imaged and a report produced, and does a facility have a backlog of untranscribed examinations) and “clinical” outcomes (does the imaging and its interpretation result in a changed clinical course or better health for the patient).

Pentecost and others interviewed agree that for radiology, the process outcomes are much easier to measure. Technology, particularly the radiology information system (RIS), has made the compilation of process outcomes easier to audit, Pentecost says.

“Many radiologists think process measures can be massaged to improve practices,” he says. “Another group says that interpreting an exam quickly is good but that it’s the accuracy of the interpretation that counts. They say that not enough energy is being expended on the true clinical reporting of outcomes.”

UCSF’s Smith-Bindman agrees that the easier focus when seeking quality markers is on process. “When [clinical] outcomes are sparse, practice guidelines and process measures often are used. However, there are no practice guidelines that are evidence based, making this approach tricky. …In radiology, we know so little about what is good that it makes the use of process measures difficult.”

There are exceptions, such as the documented life-saving benefits of routine screening mammography, Smith-Bindman adds. But even finding cancers is not always beneficial to the patient, she says, and so some screening tests may not conform to quality health care. She uses the example of screening for lung cancer with CT. “Whether or not finding the cancer is helping the patients is not known. … There is so much cancer out there that never hurts anybody, and that’s why we need randomized trials. … We need to train our residents and fellows to do these kinds of studies and provide the data, because if we don’t, the rise in the cost of imaging will preclude a lot of these tests.”

Smith-Bindman calls P4P “a great incentive” to identify patient-beneficial imaging outcomes, but she says the industry still remains focused on creating new ways to image. “We need to move our scientists from developing toys to helping patients.”

In the meantime, while clinical outcomes are being developed, Smith-Bindman says the quality metrics for which radiologists are likely to be rewarded under P4P will be process assessments and measures like physician experience. “For some of the more expensive tests, [payors] might say that doctors have to do a certain number to be reimbursed. Volume may be used as a surrogate marker for quality.”

Advantage Radiology

Although tracing the clinical impact of a radiological examination can be difficult because of the many other physicians and caregivers involved in the treatment process, radiologists do have some advantages when it comes to defining quality care metrics that might end up in P4P schemes.

Carl D’Orsi, MD, is a professor of radiology at the Emory University School of Medicine, Atlanta, and an expert in breast imaging. Over a decade, D’Orsi helped develop the Breast Imaging Reporting and Data System (BI-RADS), a template for conducting and reporting mammograms. BI-RADS details six categories of mammographic assessment, each with a specific recommended course of action, from routine follow-up examinations to immediate biopsy or calling in a cancer specialist.

D’Orsi says an essential function of BI-RADS is to act as a lexicon for breast imaging, so that the same terms are used in the creation of reports.

“If you don’t know what your colleagues are saying or listing in their reports, then you are dead to begin with, because we don’t know what each other is talking about,” D’Orsi says. But BI-RADS is more than a lexicon, he adds.

“Because of the uniqueness of mammography, we’ve gone further and produced report cards for ourselves. We can now measure outcomes, such as how many cancers we detect, how many patients we call back, and what is the standard for this. We are acting as physicians as well as epidemiologists. We’re aiming to screen approximately half the people in this country, and this is not matched by anything in imaging.”

D’Orsi says BI-RADS can and undoubtedly will be adopted for use in P4P initiatives. “It’s right there and ready to be used. The problem is that it will only add to the paperwork for the radiologists.” He wonders, he says, if radiologists will be paid for documentation time under a P4P scheme. “It may be ‘pay’ in small letters and ‘performance’ in capital letters,” he says.

D’Orsi says establishing metrics similar to BI-RADS for other areas of imaging may not be as workable as mammography has shown itself to be.

“Say you’re doing an image of someone with belly pain—what kind of metric are you going to use? Did I find what was expected? Did I change clinical management? I think changing clinical management is one of those solid metrics, but then what is the standard? Did I change 10% or everyone? I hope the government doesn’t get involved in putting out some halfway process. This has to be thought out by experts so that the results are meaningful.”

D’Orsi says BI-RADS may function as “an example of defining the parameters of the various things we do in radiology, which are massive.” He also says that as a step toward metrics, the Appropriateness Criteria developed by the ACR is “a superb beginning for what has to be accomplished, and I’m sure no other area of medicine has the ability to do this.”

Overall, D’Orsi says, radiological outcomes “should be patient based—did we change clinical management, did we diminish unwanted experiences—the best practices thing, I’m leery of.”

The Role of Lexicon

Curtis P. Langlotz, MD, PhD, associate chair for informatics in the department of radiology at the University of Pennsylvania Health System, Philadelphia, is heading a Radiological Society of North America (RSNA) initiative to develop an extensive lexicon that will lend itself not only to consistent reports but also to the ability to cross-reference and data mine across radiology.

Called RadLex, the uniform vocabulary eventually will allow indexing and retrieval of radiological data across institutions and practices. Beginning with anatomy and pathology, RadLex will also describe such factors as modalities, techniques, and visual image features in a consistent way using standard terms. The first version of RadLex was released at the RSNA annual meeting this year.

Langlotz calls RadLex “a first step toward creating data that could be used to measure the accuracy and outcome of diagnostic imaging interpretations. If these terms could be associated with clinical reports, we could more easily measure how particular radiology readings correlate to outcomes. It would improve our ability to compare results of disparate research studies and measure the quality of our radiology reports.”

Langlotz says it is not yet clear how a tool like RadLex will impact P4P initiatives. “I think we have a responsibility to provide high-quality care, and measuring our performance is a first step toward improving it. The optimal types of incentives that might be put in place to encourage high quality are yet to be determined. Financial incentives are not the only way to encourage quality.

“The next major step toward data mining,” Langlotz says, “is to create software to associate these standard [RadLex] terms with clinical reports, without slowing down the radiologist.”

Grand Design

Researchers and developers at Massachusetts General Hospital (MGH) and Harvard Medical School, Boston, say they already have moved in that direction, drawing on SNOMED—aka the systematized nomenclature of human medicine, a multiaxial nomenclature for indexing medical records—as well as BI-RADS and RadLex to create software that can extract data from ordinary unstructured radiological reports.

Keith Dreyer, MD, PhD, is vice chair of radiology informatics at MGH and an assistant professor at Harvard. Dreyer and his colleagues have designed a system called LEXIMER, which they have licensed commercially to a radiology software vendor.

Dreyer is communicating with health networks about LEXIMER and a companion software program, RadCube, that allows radiology providers to chart data and trends, such as productivity and utilization management. He says that the interest from health care providers has been strong and that many have installed the products already. (For more information, visit

LEXIMER, an acronym for lexicon mediated entropy reduction, is a form of natural language processing used specifically to structure unstructured radiology reports. It can analyze any radiology report for findings, pathology, recommendations, and other elements, Dreyer adds. “The whole premise is that if we are to maximize quality and efficiency by decreasing variability, we need a definitive way to measure outcomes of the radiology process,” he says. “LEXIMER provides us with those essential quantitative endpoints.

“If I dictate a report that says there’s a 2-cm mass in the liver, that it’s suspicious for carcinoma, and that the kidneys and pancreas are essentially unremarkable, and I recommend a follow-up with MRI,” he continues, “LEXIMER extracts the findings and recommendations from the other statements and records ‘2-cm mass, carcinoma probable, MRI recommended.’ It throws out the noise and structures the signal.”

Although structured reports are essential for quality and outcome analysis, Dreyer says that conventional structured reporting systems have proven too cumbersome, requiring radiologists to select every element in a report by pulling down menus and clicking as they go along. LEXIMER essentially does this after the fact.

LEXIMER has been cojoined at MGH with another internally developed software application called ROE DS, based partly on the ACR Appropriateness Criteria, to guide and track when and how referring physicians are ordering radiological tests.

“Now, we can start to look at quality and performance throughout the entire process from a variability perspective,” Dreyer says. Radiologists can be tracked to see if their reporting is consistent with that of their peers, and referrers can be guided and tracked to ensure that they are ordering the proper examinations given their patients’ symptoms.

Since installing the system at MGH, Dreyer says, the ordering of inappropriate examinations has fallen from 15% to less than 2% of total examinations. Just how ROE DS and LEXIMER might play out as a nationwide P4P methodology is as yet unknown, but Dreyer says that payors are eager to improve quality by reducing unnecessary utilization now.

“The major payors in eastern Massachusetts already have approved this system as a method for preauthorizing exams,” he says, “and it has started to grow beyond that catchment’s area.”

Will P4P Work?

However rosy the prospects for marshaling radiological outcomes for P4P may be, questions about the efficacy and fairness of any P4P initiative are raising concerns among radiologists.

Kimberly Applegate, MD, is a pediatric radiologist at the Riley Hospital for Children, Indianapolis; she is also a member of the ACR Metrics Committee. Applegate says P4P could be a “very good step” if properly taken, but she says she worries that doing it right may take time.

“No matter what we do, we always think we’re going to get it right, but we might not the first time,” she says. “HMOs, for instance, overall have failed to live up to what people thought they should be. If you concentrate on only 10 measures, you might prioritize those at the expense of common sense. For example, if a metric is to treat pneumonia within so many hours in the ER, we may end up doing that even before we decide if pneumonia is what the patient really has. That’s one of my main messages.”

Applegate says the ACR will comply with the Congressional P4P mandate and will have a dozen or so radiologic metrics ready by the end of this year. Some of them will deal with communication, such as how nonroutine findings are reported, she adds.

But, like UCSF’s Smith-Bindman, Applegate thinks a lot more research is needed to determine just which radiological tests are beneficial. “We don’t have a lot of research data to determine what quality indicators we should use for outcomes,” Applegate says, “but there are quite a few for structure and process.” It is those areas where P4P will first focus, she adds.

It might be good to ask private practices to verify such things as whether they use autoexposure controls to vary CT radiation doses for patients of different size; however, Applegate says she is against singling out individual practitioners, or even individual radiology groups, for stricture under any P4P scheme.

“My main opinion about P4P is that it’s best implemented at the health care system or health insurance level. Health care is so complex that it’s hard to tease out the impact of any one individual relative to the rest of the system,” she says. “I’m not big on withholding pay if people make a mistake. That’s very much against the whole idea of quality improvement.”

Kaiser’s Pentecost says there is much yet to think through about P4P, including its fundamental fairness. “Indigent hospitals always reap less when there is any reward,” he says, asking what if a hospital is running in the red and cannot afford a RIS to document performance measures. “The indigent always do worse on performance measures. All these [metrics] are communicational and language things with huge socioeconomic issues. Hypertension control is always worse among indigent patients, the screening rates for breast cancer are substantially different based on race. This P4P system won’t address a lot of baseline issues. Will it just be paying the well-to-do more money?”

Pentecost says it is also unclear how P4P funds will be distributed among various practitioners. “How do you handle it when somebody is covering for you on weekends? The way we break out the money is a huge problem.” If a patient is being treated jointly for both heart and lung disease, who gets the bonus, the cardiologist or the pulmonologist, he asks.

“There is,” Pentecost says, “a lot of suspicion by physician groups that P4P is an insurance strategy not to pay you at all if you don’t use the metrics. The physician groups and the insurance companies view each other with great suspicion.”

Despite concerns like this, as Applegate emphasizes, P4P is in play and cannot be ignored by radiologists. “We all have to be involved in developing these metrics, because if we aren’t, someone else—who is not a radiologist —will do it for us.”

George Wiley is a contributing writer for  Axis Imaging News. For more information, contact .