Big Data Overload
The quest for knowledge in an era flooded with information
Every step you take, every move you make… Science can learn from you.
The tech revolution that has put iPhones in our pockets and a world of Google-able data at our fingertips has also been ushering in a golden age of health research. Take, for example, work being done by Thomas Glass, PhD, and Ciprian Crainiceanu, PhD, and their teams. They recently clipped accelerometers—smaller than iPhones—onto the hips of elderly research subjects. The devices can record people’s motions in detail, for indefinite periods and in real time if needed. The immediate aim, says Crainiceanu, a Biostatistics associate professor, is to devise a truer method of recording the physical activity of the elderly. But it’s the kind of approach that could turbocharge a lot of other health-related science. No more questionnaires, no more biased recollections, no more droopy-lidded grad students analyzing hours of grainy video. Just the cold, hard facts, folks. Just the data.
“In principle, we could take inputs from a wide variety of sensors—say, heat sensors, or portable heart monitors sending data by Wi-Fi or cell phones,” Crainiceanu says. “Our imagination is the limit.” And it’s not just portable gadgets that are making this possible. Brain imaging technology is still big and expensive, but its use is becoming more routine, and it now can deliver information on neural activity and density and connectivity at volumes on the order of a cubic millimeter. Next-gen genomics technologies can catalog DNA and gene-expression levels rapidly and with base-pair precision. Medical records are migrating to the digital and Web realms and containing ever more numeric and imagery-related detail. This gold rush of data gathering represents “an opportunity not just in terms of improving public health but also within biostatistics, for it gives us this tremendous new set of problems to work with,” says Karen Bandeen-Roche, PhD, MS, the Frank Hurley and Catharine Dorrier Professor and Chair of Biostatistics.
And the problems can be considerable. It’s not unusual for a public health study dataset nowadays to require a storage capacity on the order of 10 trillion bytes (10 terabytes)—the equivalent of tens of millions of 1970s-era floppy disks. Larger datasets are inherently better in the sense that they have greater statistical power to overcome random variations (known as noise) in data—just as 1,000 coin flips will be better than five coin flips at revealing the true 50/50 nature of a coin flip. In practice, though, large health-related datasets often contain a grab bag of information that isn’t always relevant and is distorted (biased) by hidden factors that may confound the savviest statistician. Moreover, traditional data collection, storage and analysis techniques can’t always be straightforwardly scaled up to terabyte levels. “How to design data collection properly, how to avoid bias, how best to represent a population of interest—these sorts of challenges may be even greater for the ultra-large datasets than for the more manageable ones with which we’ve traditionally dealt,” says Bandeen-Roche.
For Crainiceanu and his team, the goal was to turn days of raw, wiggly, three-axis accelerometer voltage readouts into meaningful interpretations of human movements. Such a task essentially attempts to reproduce—with an artificial sensor system plus software processing—the ability of higher organisms like mice or people to recognize individual movements amid the vast, noisy streams of visual and somatosensory signals coming into their nervous systems. It’s a big-data-processing skill that took us mammals tens of millions of years to develop, and even in furry, small-brained ones it involves myriad wetware layers of filtering and logic.
Crainiceanu saw the parallels to neural processing right away, and chose speech perception as a guiding analogy. “Movement is essentially like speech,” he says. “It involves units like words, which combine into meaningful sequences that are like sentences and paragraphs. So we started by processing the accelerometer data into the smallest meaningful movement units, which we called movelets.”
Movelets represent short bursts of motion data, roughly analogous to the phonemes that make up words. Breaking down the voltage readouts into movelets made manageable what would otherwise have been an ocean of data. “We sample the accelerometer data 10 times per second, so for three axes we’re gathering on the order of 30 observations per second,” says Crainiceanu. “And let’s say we want to monitor hundreds or thousands of people for a week, or a month, with their data continually being uploaded via the Web, for example.” His team’s movement-recognition algorithm essentially can crunch all these data—terabytes’ worth, for a large study— into relatively compact histories of distinct motions (now sitting … now getting up … now walking…), just as a speech recognition algorithm can condense a storage-hogging raw audio recording into a few pages of text.
Crainiceanu’s colleague Rafael Irizarry, PhD, a professor in Biostatistics, faces a similar challenge when he helps biologists sift through gene-sequencing data. “Modern gene sequencing technology is generating such enormous datasets now that biologists are having a hard time saving it on disks; NIH has even been having meetings with experts in the field to figure out how we’re going to store all these data or whether it would be more cost-effective just to generate it again whenever we need it.”
Genomic datasets also can be devilishly hard to analyze. Modern sequencing devices typically generate raw data that represent the color and intensity of fluorescent reporter molecules linked to short stretches of DNA; these intensity levels have to be interpreted into “reads” of the GATC genetic code. Each of these short, not necessarily error-free readouts of DNA then must be pattern-matched to the right location on a three-billion-base-pair reference genome—a bit like finding the right spot for a tiny piece in a football-field-sized jigsaw puzzle.
“When I first got one of these datasets,” Irizarry says, “I wrote my own little software routine to handle it and I ran it and waited … and then realized that it was going to take six months to finish!” Irizarry soon hired a computer scientist, Ben Langmead, MS, who has expertise in solving this kind of problem quickly. Their group, working with Johns Hopkins Medicine geneticist Andrew Feinberg, MD, MPH ’81, has since been putting out a steady stream of high-profile papers on the genetics and epigenetics of tumor cells. (Epigenetics refers to reversible DNA-modifications that silence some genes and let others be active; derangements of the normal epigenetic patterns in cells may be as important as genetic mutations in promoting cancers.)
And then there is the uncertain value of some ultra-large datasets. “They often come with lots of complications and biases that don’t exist in smaller datasets,” says Scott L. Zeger, PhD, the former chair of Biostatistics who is now the University’s vice provost for Research. “A large observational study could be much less informative about the effects of a treatment than a smaller dataset from a placebo-controlled clinical trial, for example,” he says. Even among clinical trials, he adds, the traditional single-center study tends to be less noisy than the multi-center studies that are increasingly the norm in many areas of health research.
Even so, the promise of all that data now encourages researchers to go where they might have feared to go before.
Brian Caffo, PhD, associate professor of Biostatistics, recently led a Johns Hopkins team in a competition to use neuroimaging data to predict ADHD diagnoses. The organizers of the ADHD-200 Global Competition gave Caffo’s team, and 20 other academic teams, structural and functional MRI data on 700 children to use in training their image-data-crunching algorithms. Then the teams were asked to use their algorithms to determine which of 200 new subjects had been diagnosed with ADHD.
One key to dealing with today's ultra-large datasets is knowing what to leave out, says biostatistician Brian Caffo.
“With multiple images per subject and multiple processing stages, we ended up handling trillions of bytes of data,” Caffo says. “But the predictive value of the imaging data turned out to be weak.” (In fact, a slightly higher-scoring algorithm devised by a University of Alberta team relied entirely on the handful of non-imaging data given, such as IQ, gender and age, and was disqualified by the judges for failing to adhere to the spirit of the competition.)
Knowing what to leave out is definitely a part of the challenge of big datasets, Caffo says.
Bandeen-Roche couldn’t agree more. “Sound statistical thinking is as needed or even more needed than ever to assure that what comes out of these tremendous technological resources are really valuable, valid findings,” she says.
Also needed more than ever, as these big-data challenges increase, are biostatisticians themselves. “The demand these days is always greater than the supply,” says Caffo. “In fact, statistics is often rebranded as something else—sabermetrics [baseball stat analysis] and Web analytics are two examples—in part because our field doesn’t produce enough people to fill the need.”
The intense math training needed, and the esoteric lingo—“Granger Causation,” “Markov models,” “Pearson’s Chi-squared test” and so forth—probably has something to do with it. “We’re also poorly branded,” Caffo says. “Biostatistics is actually one of the most exciting fields to go into right now.”