Correlation does not equal causation. So is correlation enough?
In this article in Wired, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, Chris Anderson discusses how big data is impacting the scientific method. The scientific method is based on testing hypotheses and designing experiments to prove or disprove them. With massive amounts of data available, do scientists still need to follow this process?
You still need to have some idea what sorts of questions you want to answer, but the designing of experiments may be giving way to mining massive amounts of data to see what we can learn. Anderson argues:
“Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot….With enough data, the numbers speak for themselves.
In this article, What’s to be Done about Big Data?, Gil Press discusses the book Big Data: A Revolution that Will Transform How We Live, Work, and Think by Viktor Mayer-Schönberger and Kenneth Cukier. He argues that correlations are enough in many situations:
The authors correctly say, “For many everyday needs, knowing what not why is good enough.” The book is full of such examples from making better diagnostic decisions when caring for premature babies to which flavor Pop-Tarts to stock at the front of the Walmart store before a hurricane. Big data can help answer these questions, but they never required “knowing why.”
This indeed is one of the key themes in the book, that “society will need to shed some of its obsession for causality in exchange for simple correlations: not knowing why but only what. This overturns centuries of established practices and challenges our most basic understanding of how to make decisions and comprehend reality.”
The trick is to have sophisticated correlation metrics. The simple linear correlation metrics offered by most “analytics” packages, Pearson’s correlation coefficient and linear variance and covariance, aren’t useful for scientific research, since most physical and biological systems are not linear.
In the diagram below, the distributions in the bottom row all have zero linear correlation, so linear correlation metrics will not identify any relationship in the data, when clearly the distributions are not random.
P-values are the most commonly used measure of statistical relevance in scientific research. Non-linear measures of correlation, such as mutual information, are necessary if we want to discover complex relationships in the data. These are generally only available in statistical software packages.
Research that lets the data do the talking has been incredibly expensive because the available statistical packages are based on 30-year old code, developed in a time when massively parallel processing wasn’t possible. In order to do scientific analysis with these packages, supercomputers or clusters with thousands of nodes are required.
Statistical correlation metrics can drive innovations in every industry, once freed from the constraints of multi-million-dollar investments, the need for data scientists and reductionist approaches, and analysis times that can take weeks. As Chris Anderson says, the opportunity is great:
Learning to use a “computer” of this scale may be challenging. But the opportunity is great: The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
Don’t have a supercomputer for your research? Simularity can let your data do the talking without the multi-million dollar investment.