Similar to the fallacy of the appeal to authority is the fallacy of the appeal to big data.
The appeal to authority is to argue that because someone is an expert, his opinion is sufficient to dismiss other evidence. While there is place for expert testimony, other evidence is still relevant even if it contradicts the expert’s opinion. For this discussion, I’m assuming that the expert has relevant expertise. His expert opinion is not sufficient to dismiss other evidence.
I see an mirror-image fallacy occurring when using big data. We are eager to use findings in big data as sufficient to settle an argument. This is similar to the appeal to authority. The argument goes something like the following: because big data has had success in the past, this new big data result must be right. This very attractive presumption seems to be a huge rationale for investing in big data in the first place. We want to use it to make decisions quickly. To make that quick decision, we are encouraged to rely on the argument that we trust the authority of the big data.
In the description of the expert testimony, I mentioned that it still has to be weighed against all other available evidence. Evidence typically is in some form of historical data. Generally, skilled investigators, researchers, or analysts will collect and select this evidence. There is an implicit application of authority for this data selected for evidence. In contrast, I want to focus on the discovered hypothesis that is suggested by some pattern observed in data that was collected and selected for some other purpose unrelated to the investigation.
In previous posts, I described using big data to discover patterns that may suggest new hypothesis. Unlike an investigation into a known event, this type of query may suggest a previously unknown event may have occurred.
Consider for example a data store about business transactions for a company. The same data can be used in two different ways.
- The first way is when the company is already suspected of committing some kind of fraud and skilled investigators search the data for evidence to back up or to refute that suspicion.
- The second way is when analysts are performing a routine study for a company we have no prior suspicions of fraud but then discover a pattern that appears fraudulent.
Both examples use the same source data and may even result in the same result sets. The first case is building a case either for or against a prior hypothesis. The second case is discovering a new hypothesis.
It is that second case that presents the risk of the fallacy of the appeal to big data. I discussed in earlier posts that discovering a hypothesis suggested by patterns seen in data is useful but should be followed by a separate effort to test that hypothesis. Sometimes, the only test available is to put the hypothesis into practice immediate and supposedly we accept the risk of making that decision based on an untested hypothesis. The fallacy is when we make the same decision without recognizing the risk: we have unjustifiably high confidence that such hypotheses from big data will be simultaneously discovered and tested.
The two scenarios actually have very different processes.
In the first scenario, the collected evidence from big data will be combined with other evidence and testimony to come to some conclusion. The prior proposal of a hypothesis effectively sets up the expectation that the investigation needs to continue beyond finding collaborating evidence in historical data. Any newly found data will need to be challenged to make sure it is reliable or supported with other evidence in order to strengthen the confidence in the data.
In the second scenario, the discovered hypothesis comes as a surprise. That surprise sets us up with a different expectation. Specifically, the surprise motivates us to act immediately on this information.
The following example may illustrate my point. In the past decade or so, there were a number of instances where reviews of corporate network logs identified some staff accessing Internet sites explicitly forbidden by corporate policy. In some cases, this discovery was sufficient for immediate disciplinary action for a staff who had no prior record of any bad behavior. This is an example of an appeal to the authority of big data. A more valid approach would have been to use this to open an investigation to collaborate this suspicion. Later we learned that there are completely innocent ways this can occur: pop-up Internet advertisements, changes in hosting of legitimate sites, or deliberate tricks to redirect to a bad site. This was especially popular with social-networking sites that were discouraged on corporate networks.
When I call the appeal to big data a fallacy, I am saying a hypothesis discovered in big data is an insufficient justification to take immediate action. The discovered hypothesis needs testing such as in the above examples where that testing means opening an investigation. An alternative additional form of justification is the addition of our judgment to take the chance (and responsibility) of being wrong.
In several previous posts, I discussed many ways that data can be misleading. The data store will include observation and non observation data. Some types of observed data is more reliable than other types. Similarly some non-observed data may be data generated by models where some models are more trusted than others or the models may have unexpectedly become obsolete. Some data may exist that has basis for trust: either it is never tested or it is irrelevant to any causal relationship with the other data. There are plenty of reasons to suspect that any data query can return results that do not accurately describe the real world.
Also with large enough data sets with huge populations and with a large number of dimensions, there is a risk of finding patterns that are so prominent to our imaginations that we find them convincing. This is the basis of discovering hypotheses, but there is a real risk that the pattern is accidental and not representative of the real world. While it can be impressive to discover remarkable new hypotheses, there remains a task to test these hypotheses through finding additional supporting evidence.
The hype and investment given to big data solutions recently have increased the incentive to prove their value by producing immediately actionable information. Immediate exploitation of information from big data can be facilitated by assuming that hypotheses can be simultaneously discovered and tested. The fallacy is thinking that the big data eliminates our need to make judgments either to find more evidence (open an investigation) or accept responsibility for the chance of being wrong.
This fallacy is like the fallacy of the appeal to authority. Both are insufficient to prove an argument. Both are very popular.