By Aine Cryts
Drinking two or more alcoholic drinks per day increases the risk of colon cancer among women. More than five years of estrogen plus progestins increases the risk of women having breast cancer. We have this guidance today because of Nurses’ Health Study, which has been capturing health-related data about women since 1976.
The Nurses’ Health Study is part of Boston, Mass.-based Partners Healthcare, which is also home to the Physicians’ Health Study. The Physicians’ Health Study has captured data since 1982 and published 200 reports on the benefits and risks associated with taking aspirin and beta carotene in the primary prevention of cardiovascular disease and cancer. That’s an enormous amount of data about a lot of study participants, says Brent Richter, associate director of information services operations at Partners Healthcare. His team’s responsible for ensuring that the technology needs across Partners Healthcare’s research organization are met.
One of the ways Richter’s team plans to do that is with its “data lake,” which will use storage and big data analytics technologies from Hopkinton, Mass.-based EMC and the company’s partners, Pivotal Software and VMWare, both of which are based in Palo Alto, Calif. “Being able to provide a single place to converge [our] data and also to develop analyses and analytics on top of it and create applications for patient care” is why Partner decided to pursue a data lake solution, says Richter.
Why a data lake?
Partners has many databases, says Richter. One of those is the Research Patient Data Registry (RPDR), which he describes as a centralized clinical data registry/warehouse that gathers information from hospital systems and stores it in one place. “The goal of the RPDR is to [bring] clinical information to a researcher’s fingertips and [ensure] the security of patient information,” according to Partners. Other databases – including PACS and pathology systems – are storing petabytes of data that are maintained in separate systems, says Richter.
A data lake is different from a data warehouse in a few important ways. One of those is the way that a data warehouse is organized – namely, it’s “relational structure,” says Richter. “Your data warehouse can be queried quickly to extract data, then you can provide that to other applications that are consuming that data or, in the case of the RPDR, where you’re providing that to the investigators based on their research protocols,” he says.
By way of contrast, what Partners is trying to do with its data lake is bring all of its various databases together and then use that big data to improve patient care. Richter says the data lake is providing a place to combine the data held within the RPDR together with a lot of the other data it has access to – data that’s largely held in disparate systems – in order to “speed that time it takes for research to discovery to the clinic so that we can improve patient care and have better outcomes,” he says.
Relevant public data will be hosted within the data lake, whereas individual research centers will have their data hosted in their own data repositories, according to Partners. “Enterprise data will begin to flow into the [data] lake towards the end of this year and into year three as we work to incorporate access to all DICOM images and their metadata that’s now lying in the hospital’s PACS systems,” says Richter.
Thus, Richter describes Partners’ data lake as “very much a learning platform and program.” The project started last year, so Partners is now in the middle of its two-year project. He expects to start seeing this project in production mode by the end of this year with the Massachusetts General Hospital (MGH) Cancer Center and the Center for Integrated Diagnostics at MGH, both of which are part of Partners. The pathology team at Partners is also very interested in working with the data lake.
Another key difference between a data warehouse and a data lake is the data you’re feeding into the data lake, says Richter. That’s largely because the data doesn’t need to be structured in a particular way to be “flowed” into the data lake. “A data lake will take what data’s in the data warehouse as found, and if you’re using some of the new technologies like Hadoop, you don’t have to structure the databases beforehand, like with your Star schema and fitting your data into that. You’re just putting your data into locations and then determining how you’re structuring your data once your data is in the system,” he says.
(Hadoop is an open-source software platform that enables organizations to manage big data. As Mike Gualtieri, principal analyst with Cambridge, Mass.-based Forrester Research, described it to Information Week,1 Hadoop makes it easier for an organization to store and process very large amounts of data.)
Take the Nurses’ Health Study, for example. “[These studies] have a rich body of knowledge, Richter says. “[Researchers] have been following these individual subjects over 20 years, and they’ve been following up every few years with questionnaires so they’re collecting longitudinal data.” With access to the petabytes of data, researchers will be able to tap into the RPDR and images currently stored within Partners’ PACS and pathology systems.
(Click here to read part 2 of this article.)
- Bertolucci, J. (2013 November 19). How to explain Hadoop to non-geeks. Information Week. Accessed August 20, 2015 via http://www.informationweek.com/big-data/software-platforms/how-to-explain-hadoop-to-non-geeks/d/d-id/899721.