Picture archiving and communications systems (PACS) have solved many problems for diagnostic imaging. Although PACS have been most prevalent in radiology, other medical specialties such as cardiology and pathology are now beginning to see the benefits of electronic image review, distribution, and archiving.

The electronic PACS archive, like the film library archive, is the cornerstone of the diagnostic electronic-enabled radiology department. It is not surprising that electronic archiving is often overlooked in the early PACS procurement cycle. Like the film library, the PACS archive is often relegated to the basement, out of sight and out of mind.

The PACS archive as traditionally supplied by major vendors has been an afterthought add-on with little real consideration of the clinical needs of the medical profession. As commonly embodied, the PACS archive is a single optical jukebox that soon becomes full. The media are expensive and have failure rates of 3% or more. Many systems are single- threaded, which prevents concurrent archive and retrieval or efficient servicing of multiple concurrent users. How to optimize the electronic image archive-retrieval process has been an ongoing debate with many marginally successful iterations. It can be argued that a whole industry has been created in an attempt to ameliorate the shortcomings of most image server-archive systems. Concepts such as prefetching, mini-PACS, autorouting, compression, image management, and workflow management have all been proposed as means to mask poorly designed image and archive servers.

Few of the potential benefits of PACS can be fully realized without appropriate electronic archiving and efficient retrieval.? Against this paucity of efficient archive and retrieval solutions, the Cleveland Clinic Foundation (CCF) prepared to begin a major expansion of its radiology-based PACS in 1996. A key realization by the CCF was the identification of the need for a data warehouse rather than an archive. The implied differences between these two types of devices have become crucial to the overall success of the CCF system.

More than an archive

There are four fundamental concepts that are critical to the long-term successful electronic image warehouse. The first and most fundamental of these is that the warehouse is much more than an archive. The reliable storage of information is an important but small part of the successful image data warehouse. The warehouse is optimized for data retrieval and distribution. It may include HSM (hierarchical storage management software), automated media management, and intelligent multi-threaded data retrieval.

The CCF Division of Radiology began storing MR and CT cases on MO (magneto-optical disk) in 1995. More than 2,000 MO media are archived offline on shelves. In May 1997, there were typically three to six retrieve requests each day into this data archive for comparison cases. The typical retrieval time was 30-60 minutes. In June 1997, archival to MO was augmented by an automated data warehouse with typical retrieval times under 5 minutes. By September 1997, archive to MO for both CT and MR was terminated. Imaging studies were being sent only to the warehouse. Of major note is the change in prior examination retrieval rates. By July 1997, more than 40 cases were being retrieved daily. By September 1997, more than 200 cases were being query retrieved daily. Subsequent deployments to vascular medicine, radiology/ultrasound, nuclear, and the cardiology catheterization laboratory have shown the same pattern of increased retrieval rates.

Readily available imaging has been a CCF cornerstone for providing uniform quality and timeliness of care across an increasingly diverse practice. CCF facilities include multiple sites in Ohio and Florida. The main campus comprises more than 30 square city blocks and performs more than 500,000 examinations yearly. Uniform medical care and efficiency must be delivered during both prime and on-call hours for both patients and physicians. The CCF image warehouse has contributed to an estimated 30% improvement in the efficiency of radiology professional staff as well as improved diagnostic certainty evidenced by the increased use of prior examinations for comparisons.

An important observation during this transition to automated data warehousing is shown in Table 1. At? CCF, 40% of all new examinations require that comparison studies be pulled. Within the next 14 days, an examination has a 15% chance of being pulled.? If the frequency of examination retrieval is weighted by the length of time the examination has been stored, the probability of time weighted examination retrieval from the warehouse is nearly a constant. This is seemingly at variance with many prior conceptions that the importance of data decreases significantly as it gets older. The insight provided by Table 1 shows that the importance of a given piece of data does indeed become vanishingly small as it ages. But when aged data are considered as a group, it becomes just as likely that a physician will request an old study as a more recent one. This sheds new light on the question of data retention and warehouse design. Large, fast magnetic disk arrays (RAID) are often placed in front of a data warehouse to speed access for frequently (usually interpreted as recently) recalled cases. Table 1 indicates that no amount of disk space will be adequate to maintain rapid access over time to an inadequately designed data warehouse.

There is another reason to consider long retention periods in the data warehouse. It is relatively difficult in most PACS systems to segregate storage of patient studies differentiated by required legal retention periods or medical need, for example, the legally required 21 years of a pediatric case versus the medical need for a bone fracture of a geriatric case. Table 1 clearly shows that old studies are of perceived value to diagnostic radiologists and will be often reviewed if readily available. The CCF has opted for lifelong retention.

Endless Capacity

The warehouse capacity must be appropriately sized. We found that 25MB per examination for radiology and 250MB per examination for cardiology was an accurate estimate of required image warehouse capacity. A brief note on compression is appropriate here. The CCF has a propensity to use reversible compression. This assures image quality at some additional expense in the capacity of the data warehouse. Compression has been a topic of research in medical imaging for more than 2 decades and is likely to remain so for another 2 decades. Figure 1 shows the cost of a typical fully configured warehouse for increasing storage capacity. The initial buy-in cost of an archive is relatively high but drops precipitously with increasing capacity for all technologies. When purchased up-front, incremental capacity is relatively inexpensive. This further obviates the argument

for lossy compression in the archive.

Planned Migration

A critical technology consideration in warehouse capacity is the evolution of the media and drive technology. All drive technologies have at best a 5-year useful life in today’s rapidly evolving economy. Migration of all the data in the warehouse to new technology must be planned at least once every 5 years. The penalty for failure to migrate is increased service costs on obsolete equipment and potential loss of availability of data as media and drive technology becomes unmaintainable. Another factor, which impacts the capacity decision, is the industry growth trend for magnetic storage technologies. These technologies have doubled in capacity at constant dollars every 18 months for more than 2 decades. They are expected to continue this trend for at least another 2 decades. A data warehouse can be designed with planned data migrations to new technology, which will ensure that it will never run out of space. A capacity of two to three times the 18-month trend can be a good choice to assure technology non-obsolescence.

Our initial archive size for the CCF was selected to hold 5 years of reversibly compressed cardiology and radiology examinations. This allows the warehouse manager to acquire newer, higher-capacity technology at reasonable cost after it has matured. Migration of data in the warehouse to this new technology doubles the capacity of the archive every few years. HSM software migrates data automatically from the server’s disk to media in the robotic warehouse. This software will be crucial to your overall technology strategy. Most HSM software is licensed by the number of managed media slots in the robotics. Increasing the capacity of media does not require an additional software license even though capacity continually increases.

The migration of the data in the warehouse can be straightforward and occur transparently as a background task. A key consideration is the transfer rate at which data can be migrated. The migration must occur significantly faster than new data are acquired. For medium to large sites performing more than 100,000 radiological examinations per year, technologies such as MO, CD and DVD will not be appropriate technologies. The read and write transfer rates of these devices are too slow to allow the migration of data for all but the smallest archives.

The most likely archive media for most warehouse applications is magnetic tape. Warehouse-quality tape media and drives are now available from a variety of vendors. These drive and media combinations exceed by two orders of magnitude the recording and retrieval reliability of MO and CD technologies. Each of these technologies shares several common merits. They all have fast access times, as short as 10 seconds to the first byte. They all have high transfer rates, up to 11Mbs. They all store data on media, which costs as little as $1 per gigabyte.

The CCF has just completed its first complete data migration. We migrated 4 years of data in approximately 6 months of? time. A vendor-supplied background process ran almost unattended 24 hours a day concurrent with new data being acquired. Cost of the migration, including new media and drives, is typically 5-15% of the original system purchase price.

Retrieval Speed

The speed of retrieval of information from the warehouse is the single most important parameter to understand and optimize. Most processes that can be envisioned that add new data to the warehouse are non-time critical. Reliability of these tasks is the most important specification.

Selection and optimization of the image retrieval process for the data warehouse require significant consideration. A few of the most important issues are robotic access time, drive load and access time, and media transfer rate. All of these physical parameters must mesh efficiently with a piece of HSM software. Appropriate HSM should degrade raw media speed specifications by no more than 10%. This is a crucial requirement for the large data elements inherent in the medical image warehouse.

For medium to large facilities, it is relatively easy to predict many of the required characteristics of the warehouse. Smaller institutions may not require any read optimization of their warehouse. Generally, fewer special procedures, fewer surgeries, and fewer specialties mean fewer needs for the examination information after an initial diagnosis is rendered.

Experience at the CCF suggests that examination retrieval in 1 to 5 minutes is fast enough to be routinely and efficiently used by physicians. If the comparison case required is fetched prior to loading of the new examination, loading and initial review of the new examination typically take long enough that the comparison arrives in the background before or close to the time it is needed. The original warehouse examination retrieval specification used by the CCF was a 25MB examination available from the warehouse and transported over the network within 100 seconds. Few PACS systems today can use data as fast as such a warehouse can supply it. Subsequent advancements now make it easy to achieve 30 seconds or better. This specification includes all media loading, robotic movement, search to midpoint of the media, data transfer from media to intermediate disk storage, transfer of the 25MB study over a network at a minimum of 5MB per second, and finally replacing the media back on the shelf by the robot.

Once the speed requirements for a single user are established, one must estimate the number of concurrent or bunched requests for data to be supported. The CCF with more than 500,000 examinations yearly can see more than 2,500 examinations in a typical 10-hour daily period. The CCF film library retrieves and distributes more than 2,000 cases daily. These estimates are also consistent with activity data from our radiology information system. The worst case situation is at 2 PM in the afternoon when it seems everyone wants everything stat. These figures and observations provide an estimated worst case-per-hour retrieval rate by taking the peak retrieval hour as twice the average over a 10-hour day. This gives a peak range of 400-500 examinations per hour. This gross calculation makes no allowance for the optimizations possible with a multi-threaded system brought about by data sets residing on the same physical media. Many robotic systems are available that will do 100 or more actuations per hour. The system specified by CCF can sustain 400 retrievals per hour.

The remaining factor to determine is the number of physical drives required to sustain your specified peak load. The most accurate way to do this is with a test program that cycles through randomly selected and positioned media reading 25MB of data. This will accurately provide the real number of accesses per hour for each drive. An example of this type of data is given in Figure 2. The CCF currently uses drives capable of processing a new examination every 22 seconds. This was calculated based on manufacturer specifications and confirmed during system testing before system acceptance. This drive is extremely fast and requires only three separate drives to achieve the full 400 actuations each hour. A practical system would add a fourth drive allocated to the write process to assure that a long data write does not slow the read performance of the system. Drives typically available on the market can vary widely. As the archive grows, it is highly probable that each access to the archive will come from a different piece of media. This behavior must be carefully understood to assure appropriate performance. The author has written a program that calculates the performance of a data warehouse based on manufacturers’ specifications and easily obtained usage information about the data. This program can be downloaded free from BRISERV.COM by selecting “downloads” and choosing LTAM. Individuals or institutions considering the purchase of a data warehouse may wish to calculate the performance and build a cost model for several different configurations using this program.

Reliability and Testing

Large 100- and even 1,000-terabyte robotic archives are common in the informatics industry with more than 10,000 installations worldwide. However, experience in the industry with large often-accessed medical warehouses is minimal. Few, if any, of these approach the 2:1 read-to-write ratio of medical imaging at the CCF. Even fewer require read optimization and the 24-hour-per-day availability required by the medical field.

Today’s magnetic media are miraculously reliable, with bit error rates of 1 in 10 to the 16th or one error in every 1,000,000 examinations. Modern software, meanwhile, continues to increase in functionality and complexity. It is now estimated and routinely observed in information services shops that eight of nine cases of data loss are caused by software. Only one in nine data corruptions is a media- or other hardware-related failure. Buying additional redundant hardware will not significantly improve system uptime, and making duplicate media copies will not significantly improve data reliability.

Many medical diagnostic applications can be designed to distribute examination data directly from the acquisition device to a co-located review station. This technique puts as few potential nodes of failure as possible between the acquisition of diagnostic information and the physician who must do a stat diagnosis. At the CCF, each modality that acquires diagnostic digital data forwards the data to a nearby workstation. This provides the simplest possible system to remain operational for stat and general system outage situations. Each review station automatically archives the data to the warehouse for general availability and potential redistribution. During an emergency or general system outage, the modalities can usually send directly to any of several different review stations.

The CCF adopted a rigorous test program prior to its initial data warehouse implementation. This program had predetermined requirements. Undegraded uptime was set at 99.5%. This corresponds to 4 hours per month of unplanned downtime. Data integrity was set at 1 in 10,000 examinations or 100 times more reliable than the best film libraries. Test scripts were developed in conjunction with the chosen vendor and equipment was tested exhaustively over a 90-day presale acceptance period. Prior to migration of the data to new media in late 1998, a similar but shorter test script was employed to assure both hardware and upgraded software reliability. Each new tape drive was exercised with 16,000 read-write cycles without failure over a 2-week period.

The CCF observes approximately one example of data loss each quarter. All but one occurrence to date has been directly the result of a software malfunction. This observed data integrity is two to three times better than the initial 1 in 10,000 specification. We have an ongoing availability and data reliability improvement program with our primary warehouse vendor. Highlights of these programs include: Improvements in data reliability to verify 100% of the data integrity on the media after the media are marked full and read only. We believe that this procedure would have prevented all observed software data corruption and will improve our data integrity by a full order of magnitude. Maintaining system availability over 99.5% is straightforward but problematic. We have found it expensive to have a computer operator available on-site 24 hours per day to handle the occasional corrective actions required. A joint program with the vendor is investigating and will implement tools that will trap the most common system anomalies and automatically page or phone someone off-site. Recovery of corrupted media is also being addressed to improve data availability on-site. None of these improvements may be necessary for many sites. The CCF intends to continue to grow the number of users and capacity of its warehouse for years to come. Overall system reliability and data integrity must keep pace with this growth.

Putting It All Together

Several PACS and storage media companies offer turnkey data warehouse solutions for radiology and cardiology medical imaging. These vendors each have multiple sites installed and operational. Another major PACS vendor’s equipment has been interfaced to a data warehouse. There remains a great propensity among the medical vendors to continue to promote in-house private labeled optical disk, CD, and DVD future subsystems with which they are familiar. Prospective buyers of a data warehouse must be persistent and unwilling to settle for less than a complete solution. Since incremental capacity in the data warehouse is inexpensive, the economics of enterprise-wide or multi-specialty solutions can be improved greatly for an institution. Radiology and the cardiology catheterization laboratories will use nearly equivalent amounts of electronic data warehouse storage in a typical large medical institution. The cardiology echo laboratories run a strong third, followed by vascular medicine for electronic data storage needs.

Additional purchase considerations for a data warehouse are similar to other major medical equipment purchases. Availability of 24-hour-per-day service with on-site or in-city spare parts, software, documentation, and media needs to be specified early in the purchasing process for the data warehouse, as the warehouse soon becomes the heart of the PACS system. Physician and staff tolerance for delays in data availability decrease dramatically once the data warehouse is installed.

A written understanding of who will recover corrupt media, how long it will take, and how much it will cost should be obtained. This service is readily available free from major data warehouse manufacturers but can be a costly charge from medical vendors. Finance and leasing programs are available that allow acquisition of storage on a per-use basis. It may be especially appealing to purchasers with limited capital funds or lack of technical expertise or time to specify a direct purchased solution. It has not been openly offered by any medical vendor, but is readily available from warehouse manufacturers.

An electronic image clinical data warehouse has been developed to address the requirements established to provide real clinical imaging needs for the CCF. Many of the requirements and implementation details are common to many other health care organizations. Obsolescence and technology migration can be planned, controlled, and used to advantage if fully understood. Because software is one of the most critical components of this technology, medical data warehouse software and technology should be acquired from vendors with proven track records of supporting strategic software and hardware products over long periods of time.

NOTE: Suggested reading and sidebars on compression, prefetching, and threading will be available when this issue appears online at www.imagingeconomics.com.

Robert A. Cecil, PhD, is network director, Division of Radiology, Cleveland Clinic Foundation, Cleveland.