The spurious correlations site has a lot of interesting charts showing various arbitrary combinations of trends that show strong correlations and yet have no rational basis that suggest causation. Also, the site has a nice feature to explore other correlations by using the hyperlinks on the chart titles to find other trends that correlate with that topic. Hidden at the bottom of the main page is a link to an entertaining video that nicely discusses how correlations are different from causation. Some of my discussion concerns the points he makes in the video. His video expresses an optimism that humans will always be in the loop to insert sanity after just a brief moment of belief that there could be a causal relationship behind such compelling correlations both in graphic form and in statistical values. The evidence I see is that optimism is misplaced as illustrated by the popularity of predictive analytics.
Predictive analytics provide technologies that use historical data to discover inter-related variables that can suggest predicting the future. Consider the spurious correlation of the divorce rate in Maine with the US per capital consumption of margarine. This a strong correlation both in chart form and in the correlation value. This data suggests several types of predictions beyond the independent time-based trend of the recent declining rates to continue. An updated value for divorce rate in Maine can predict a new value for US per capital consumption of margarine. Alternatively the data suggests that if we make a policy change that changes the popularity of margarine change, we can expect a corresponding change in divorces in Maine. Of course neither of these would happen when a human scientist considers the questions with only this data available. The popularity of predictive analytics is the promise to eliminate the human scientist in these considerations.
The above site’s correlations are easily dismissed because there are only two variables. It is easy to recognize the ridiculousness of most of the correlations highlighted on his site. However, what happens when predictive analytics suggest patterns involving many variables, perhaps dozens of variables?
Can we even permit ourselves to argue with the computed confidence of the predictions from sophisticated algorithms using dozens of variables with access to billions of data records? Predictive analytics offers the promise of making actionable recommendations of a type not possible before the availability of such large data sets and sophisticated algorithms. Decision makers can take action based on these predictions. The technology encourages this action. In addition, their confidence is unlikely to be deterred by any human objection because the relationships and data are so complex. Perhaps there are a few people who could effectively make a counter argument to the predictions but they are unlikely to be employed by the decision maker.
My earlier posts about the limitations of using historical data suggests a viewpoint that can at least raise some reason to question the results of predictive analytics. In addition to questioning the quality of the algorithms or of the abundance of data, I question the nature of the available data. In particular, the available data is limited by the nature of how that data become available. Assembling data into a multidimensional big data store involves collecting and matching data from data sensors originally designed for some other purpose. In general, the original objectives for the sensors had nothing to do with any patterns discovered after the observations are combined into a big data store.
I emphasize the concept of discovered hypotheses as the natural final product of an interpretation of historical data that was collected for other purposes. A discovered hypothesis is a new idea not previously imagined. This new idea is suggested by patterns in data. These patterns are not necessarily limited to simple statistical correlations but above site showing spurious correlations do illustrate this weakness of the data.
I am asserting that the logical stopping point of a big data analysis is a discovered hypothesis. Using the above examples, we may apply some human thinking to identify possible explanations giving some merit to the hypothesis. One possible example is the negative correlation of honey producing bee colonies and juvenile arrests for possession of marijuana. Perhaps we can imagine that there may be causal relationship here, perhaps through some hidden variable such as seeking home remedies for bee stings are some illicit use of honey that competes with marijuana. The next step is to find new data to explore these possibilities.
A discovered hypothesis is a hypothesis that passes a plausibility test to justify an investment into further testing. As I discussed in earlier posts, that testing may be make decisions to make policy changes (such as making incentives to change the number of honey bee colonies). Such decisions are valid in cases where we have no other practical option to set up an experiment and we accept the risk that we are testing a hypothesis that has never been tested.
Prior art for science is to prepare a new experiment to collect fresh observations specifically focused on testing a new hypothesis. The discovered hypothesis comes from historical data. The testing of that hypothesis should come from fresh observations perhaps using improved sensors to better document the variables involved. The experiment would involve controlled variations to test the validity and causal properties of the variables involved in the hypothesis. Testing a discovered hypothesis involves going outside of the preexisting historical data that discovered the hypothesis to obtain fresh and carefully controlled observations specific to test that hypothesis.
However, recent trends are for increased acceptability of exploiting the same historical data to run the experiments. The same data set that was used to discover a hypothesis could be used to test that hypothesis. The discovered hypothesis is accompanied with statistical tests that would have been used in a fresh experiment but those tests use the same historical data. We increasingly accept that algorithms applied to big data can simultaneously discover and test hypothesis. The algorithms available to big data projects produce a tested discovered hypothesis with no need for a human scientist in the middle of the project and no need for costly additional experimentation.
Discovering and testing a hypothesis should be two separate activities involving independent data sets. In particular, the testing of a hypothesis deserves fresh data specifically controlled to expressly test the hypothesis. The discovered hypothesis is new and thus the data that discovered that hypothesis was not originally meant to address that hypothesis. The spurious correlations site provides examples where the sources of data had their own independent justifications and optimized their methods for their narrow objectives. This is not data that can add much confidence in terms of testing a new hypothesis.
Predictive analytics, however, goes even one step further by automatically applying the hypothesis on the data to present a prediction. The hypothesis is applied to the same data that was used to discover the hypothesis, and that was the same data that was used to test the hypothesis. The objectives of predictive analytics is to automate all of this is automated with little or no human intervention for scientific scrutiny. Even if there is an opportunity to intervene, the complex nature of the multivariable relationships and the huge volume of data makes it difficult to criticize the results. Also the volume of data and sophistication of algorithms present a compelling case.
Most people on a project will be considered to be unqualified to question the results of such compelling predictions. The decision maker will be left with high confidence in his actions or approvals for policy changes.
Even if we lack the credentials to question the sophistication of the algorithms or the volume of data that supports those algorithms, we can raise an objection of the misuse of historical data to perform three separate tasks of discovering a hypothesis, testing that hypothesis, and applying that hypothesis to make a prediction about the future.