The Science of Data Science

What’s a statistician to do with data from 4 million students?

By Rin-Rin Yu

Four million students. That’s the number of enrollments in the data science program taught by associate professor Jeff Leek, PhD, MS, alongside fellow Biostatistics professors Roger Peng, PhD, MS, and Brian Caffo, PhD, MS. They’ve been teaching the program since its 2014 inception on Coursera, a site that offers massive open online courses (MOOC).

What’s a statistician to do with data from 4 million students? Analyze it, naturally. Leek and assistant scientist Leah Jager, PhD, MS, in collaboration with biostatistics PhD student Leslie Myint, started randomly assigning different quizzes to see if giving different sets of explanations or instructions would nudge students toward one interpretation of data over another.  It did. In one experiment, students were asked if a study about smoking and lung cancer demonstrated correlation or causation—the correct answer being correlation. However, when the professors added a post-hoc explanation of the study results, more students incorrectly characterized it as a causal analysis. “People may misinterpret the data if you are not careful about the language you use,” Leek concludes.

With an NIH R01 grant to study data science education and practice, the researchers are building a platform to conduct larger experiments and expand beyond the pool of Coursera students. They’ll look at how people interact with data summaries, how to help people differentiate between good versus bad data analysis and how to improve data communication. They’ll also examine reproducibility—the idea that “you should be able to hand over your [programming] code to someone else and they can reproduce your paper,” Leek explains. However, “people have struggled with it,” indicating inconsistencies in how people document their steps and interpret results. The researchers plan to run experiments to evaluate streamlined ways to improve reproducibility.

The goal is to apply the same experimental rigor to data analysis as it’s already applied to how data is collected. “There’s a phrase: Data science is as much an art as it is as a science,” he says. “Our goal is to make it less art, more science.”