Case Western Reserve University is one of three institutions nationwide to win federal “big data” grants focused on developing ways to ensure the integrity and comparability of the reams of information the US health care system collects every day. If successful, the work could create enormous new opportunities to glean insights that help physicians cure or even prevent illness and disease.

The potential of big data to improve treatment and outcomes has been well documented, as have the obstacles. Differences in systems, processes, categorization, and more make collecting and comparing electronic records feel like an overwhelming challenge. Even if necessary information can be gathered in a single place, scientists still have to be able to track it back to the original source—and also determine whether the figures reference the same condition, procedure, or result. Then consider the nearly unimaginable volume involved—researchers report that US health data totaled 150 billion gigabytes (GB) in 2011—and scientists’ estimates that about 80% of that figure is unstructured, yet clinically relevant.

The National Institutes of Health is striving to address some of these challenges with its Big Data to Knowledge (BD2K) initiative, which last month awarded nearly $7 million in grants across four major areas. Case Western Reserve’s award—which is just over $900,000—was in the category of data provenance; other recipients were the University of Pennsylvania and Duke University. In essence, data provenance seeks to establish where the information started and what changes, if any, were made between its origins and its current state. It also involves making sure the data is consistent and reproducible.

School of Medicine assistant professor Satya Sahoo, PhD, leads a team that includes researchers from Harvard and UCLA collaborating to develop a platform that collects and analyzes disparate clinical information from multiple sources and ensures that the datasets measure the same matters the same way; this comparability is key to drawing conclusions that in turn can enhance approaches to care.

“Informatics will become an increasingly key component of health care and medical discovery in the United States for the foreseeable future,” says principal investigator Sahoo, a member of the faculty of the Department of Epidemiology and Biostatistics, in a release. “We hope that the provenance platform developed as a result of this grant will constitute an important part of the informatics infrastructure needed to realize the potential data-driven research in health care.”

In the first phase of the project, Case Western Reserve computer scientists will develop the provenance engine via PROV, a new web technology standard developed by the World Wide Web Consortium (W3C) to facilitate interoperability. In the second phase, investigators will evaluate the performance of their provenance engine by using real-world biomedical big data from sleep, epilepsy, and lung cancer cases. The second phase will present an opportunity to identify problem areas or reveal new functionalities that could be incorporated into the provenance engine to make it even more effective.

Sahoo’s team will work with de-identified patient case data along with Samden Lhatoo, MD, professor of neurology, Case Western Reserve University School of Medicine, and director of the Epilepsy Center, University Hospitals Case Medical Center; Susan Redline, MD, professor of sleep medicine, Harvard Medical School, and director of the Sleep Medicine Epidemiology Program, Division of Sleep Medicine, Brigham and Women’s Hospital, Boston; and William Hsu, PhD, medical imaging informatics expert in lung cancer, and assistant professor, Department of Radiologic Sciences, University of California, Los Angeles.

Lhatoo and Redline collect and store different kinds of data, but neurological similarities exist between epilepsy and sleep disorders. Lhatoo’s particular expertise is sudden unexpected death in epilepsy (SUDEP).

“Epilepsy is often a nocturnal phenomenon, and there is close association between sleep and seizures,” Sahoo says. “It is known that sleep stages affect expression of epilepsy, and sleep has an impact on the poorly understood occurrence of SUDEP. So there are significant benefits to correlating epilepsy with sleep research data.”

The goal is to integrate datasets from previous studies and ongoing studies from Case Western Reserve and Harvard. Lhatoo and Redline would be particularly interested in studying data from radiology images, signal data from electroencephalograms (EEGs), and patient discharge summaries. They can use the combined data to create their own study cohorts or compare their own study results. The provenance platform will provide sufficient statistical size upon which to base valid research or treatment conclusions.

From UCLA, Hsu will feed his radiologic imaging datasets of lung cancer patients to Sahoo’s team members, who in turn will incorporate the data into the provenance engine. UCLA radiation oncologists and other lung cancer researchers will then use the provenance engine to access quality data and evaluate their research.

“Data provenance is a key focus of my medical informatics and big data research,” Sahoo says. “Also, I had been collaborating with Dr. Lhatoo on computerization needs for SUDEP research at Case Western Reserve, so it made sense that I propose something for data provenance that would incorporate SUDEP.”

Sahoo credits Case Western Reserve’s exceptional computing capability as a major contributing force in landing the grant. The university’s High Performance Computing initiative (HPC) will enable storing unified data and running algorithms to test the performance of Sahoo’s provenance engine before making it available to the public.

The data provenance grant joins other federal funding to Case Western Reserve. Bioinformatics and computational biology researcher Mehmet Koyuturk, PhD, received a $1.3 million NIBIE grant for theoretical foundations and software infrastructure for biological network databases. Koyuturk is associate professor in the Department of Electrical Engineering and Computer Science at Case Western Reserve.