Matthew Stephens, a professor of human genetics and statistics at the University, recently received a major award to develop a method to ensure data scientists’ data is preserved, accessible, and open to further research. He is one of 14 investigators who received a five-year, $1.5 million Data Driven Discovery grant from the Gordon and Betty Moore Foundation.
Stephens’s project focuses on dynamic, statistical comparisons. Increasingly, complicated statistical methods are used to reveal relationships within larger data sets. One application in genetics is the relationship between different gene variants and blood cholesterol, which could lead to new diagnostics or better treatment options. Applicable in numerous fields, these comparisons can lead to groundbreaking discoveries and speed up the rate of discovery.
“The idea is to make statistical comparisons easily extensible, easy to update,” Stephens said.
However, these relationships are often published and never seen again. Sometimes the data isn’t available or reproducible, or the statistical software applied to the data isn’t distributed. This poses hurdles to future scientific discovery, as it makes it more difficult to compare data through multiple studies.
“A lot of work is done making these comparisons, but all that work gets forgotten and reinvented again and again,” Stephens said. “If a student in my lab is running a comparison [on the same data] that someone’s run before, unless the first person documented it really carefully, they’re going to have to reinvent the infrastructure.”
His long-term goal is to implement a generalized method for structuring and distributing all manners of statistical comparisons, which would require sharing all code and data. This would make reproducibility immediately transparent and give advancements a higher turnover, eliminating the hurdles to further discovery.
“I hope to utilize existing platforms, like GitHub [an online repository geared toward programming]. The infrastructure, in terms of these platforms, is just about ready,” Stephens said.
Stephens’s framework could easily ensure that reproducibility is possible and hasten the development of quality science.
“My primary interest in reproducibility is to make research more efficient by making it easier to build on what other people are doing,” he said.
Among other biological data, Stephens works with genetic association studies that aim to untangle relationships between genetic variants and observable phenotypes, like traits or diseases. In biomedicine, better understandings of relationships between genetics and other biomarkers can provide actionable health-related information. Several “big data” biotech startups are earnestly accumulating the largest ever troves of genomic, metabolomic, and microbiomic data, banking on their analysis creating revolutions in health care.
Unlike these proprietary databases, Stephens supports open access and publishing in open journals, though he believes publishing in high-impact journals is also important. Stephens envisions a future without off-campus paywalls, where everybody has access and the ability to contribute to the continually updating archives of human knowledge.
“It’s tricky because when you have a student or post-doc . . . their next career step may depend on what journals they publish in. . . . .[It] is a far from ideal situation,” he said.
Stephens is designing a new course for the upcoming winter quarter. He previously taught two statistics classes, and a data analysis class several years ago.
“The goal of the class is to look at some of these issues and maybe make some progress on the problem of developing dynamic statistical comparisons,” he said. While intended for second-year statistics graduate students, Stephens wants to “attract a wider array of people...particularly people who can use GitHub and R [a programming language] and would like to contribute.”
He plans to outline these plans on his new blog, randomdeviation.blogspot.com. In his first post he says he created the blog to both share information about his new project and have a lighter way to express thoughts than peer-reviewed papers.