In my last post, I complained about the lack of immediately available data to provide social, economic, cultural, or government-operational context to a breaking news event. Typically, during such breaking events, we are overwhelmed with first hand reporting of the immediate most riveting events. In the case of an urban protest or riot, there are at time more journalists covering the same event (from all the variety of news organizations) than there are protesters.
Meanwhile, we have virtually no background information to put the events in context. In the case of the Baltimore riots now nearly a week past, we are just now getting more in depth reporting of reporter research efforts to uncover crime statistics, describe history of the neighborhoods, obtain opinions of the residents who may not have been actively taking part in the protests.
My point in that post is that this information would have been far more valuable if it were available immediately as the events were unfolding. To make that possible, we need a pro-active investment in documenting each community so that these results can as quickly retrievable as street-views images from online mapping sites.
I see an analogy between the initial reporting on the Baltimore riots with big data. In this case, the big data was the abundant multi-viewpoint reporting of the same events by different reporters and photographers from many different news organizations. In addition, we had the reporting from participants, bystanders, and social-media commentators. Combined together, these generated the 3 Vs of big data: volume, velocity, and variety. Since then this data has resulted in criminal charges for six policemen. This is still a developing story and the specifics are not the topic of this post. My point here is to note the abundance of data that came out immediately as the protests and riots occurred or were assessed on the following days. While not characteristic of all big data projects, this reporting definitely delivered its own version of big data.
Now, a week later, long-form journalism has begun to deliver more in-depth data with more background information in the form of interviews, history, economics, experiences with local government, policing, and crimes. This is beginning to fill in the details that were lacking to outsiders (such as myself) at the time of the initial protests. i expect to learn more in the coming weeks. This information is adding to the volume and variety of the initial reporting. This is also similar to big data projects in that the projects continually add new dimensions and more volume as time proceeds.
I want to focus on the interval of time between the initial reporting and the eventual arrival of more background data (providing new dimensions to the data). During this interval, we were operating on abundant data captured from eye-witness accounts and readily disseminated videos. During this interval, we felt encouraged to make conclusions of from this data alone. Among those conclusions were the filings of criminal charges against the six police officers involved. However, also during this period we were operating without the presence of the background data that is just now becoming available, with expectation of more such background data to come later.
During this interval of the initial rush of information before the arrival of more in-depth background information the 3 Vs of big data about the events encouraged confidence of understanding what was happening and what should be done. The general population took that encouragement to come to conclusions based on the abundant information. Although there is some criticism that the state’s prosecutor may have also followed encouragement, my impression is that the greater justice system is moving more slowing, recognizing that the all of the necessary information is not yet available, and also constrained by long lasting tradition of deliberation. I respect the justice system’s appreciation of missing data resulting in a slow tedious process of discovery of missing data. That appreciation of missing data was largely absent during the reporting and popular discussion of the events in social media.
The title of this post suggests a metaphor that no matter how impressed we are about size of big data in general, the big data project will always be a finite entity that is overwhelmed by the vastness of missing data: big data is the ship, missing data is the sea. No matter how impressive the ship, the sea will always be far larger and always capable of sinking the ship. This is the problem I described earlier about the innovative criminal and the inability to predict human innovations. The problems is that there will always be vastly more missing data than we’ll ever have of obtained data. Missing data is the weakness of big data.
The problem we are increasingly facing is our being impressed by big numbers (petabytes, exabytes, etc) so that we are encouraged to think we are converging on complete data. After all, according to that linked reference, all of the words ever spoken by humans is just 5 exabytes.
I’m convinced this over confidence in the power of big data will be what gets us into trouble. In my thinking, even if big data gets as large as yottabytes, the big data will still be an insignificant ship on a sea of missing data. But more practically, for the foreseeable future, our data will clearly be missing more data than we have available.
The practical analogy is the criminal charges against the six police officers in Baltimore. We have much more to learn about the circumstances that lead to Mr. Gray’s death and the practices and culture of the police and government systems that led to those circumstances. I argue we can be better about proactively collecting the relevant information, but even there will be details we have to wait for future investigations to uncover.
One such mystery is the cause of Mr Gray’s fatal injury:
What it would take to break a person’s spine is heavy trauma. The spine is so guarded, so an injury like his would take a lot of force like jumping from a second floor building or getting hit by a motor vehicle. It doesn’t just happen out of nowhere.What it would take to break a person’s spine is heavy trauma. The spine is so guarded, so an injury like his would take a lot of force like jumping from a second floor building or getting hit by a motor vehicle. It doesn’t just happen out of nowhere.
The problem is that we (the public at least) has not been given a credible account of how he would have encountered such a force, especially for a hyper-extension neck injury. The van was not in an auto accident, The severity of the injury was not obvious prior to being placed in the van. If it involved some deliberate trauma by another policeman, it probably would have to be done outside of the van with an obvious display that would have at least risked witnesses during busy daylight hours. It is also possible that the doctors are not fully aware of all the ways such an injury can happen among all the varieties of individual skeletons. Perhaps Mr Gray more than usually prone to this type of injury that would have required far more violence in any one else. That may be extremely remote possibility, but I bring it up as another missing dimension: our limited knowledge of individual variety in human anatomy.
The risk of big data is that malicious people can exploit the over-confidence of big-data consumers (society, business leaders, political leaders, etc) as I’ve already mentioned. Also, the malice can come in the form of manipulation to distract the decision makers with what I called spark data. Spark data works because the population has already been conditioned to accept the volume of data (number of reports) as indicative of urgency and validity. Spark data catches attention on a topic that has no solution that will provide any possible benefit. Spark data ultimately gets people to overlook the other problems we could be discussing and could result in in beneficial solutions that happen to involve making painful choices. Spark data (deliberate distractions) can get us to overlook the missing data, the problems we have not even identified.
The bigger problem with the captivating power of big data is that it can mislead us into making hasty or wrong decisions because of our ignorance of the missing data. The Baltimore result of criminally charging six police officers for Mr. Gray’s death seems an example of making such a decision where the missing data includes not yet knowing the full range of defenses that each of the defendants may use during trial.
More worrisome is that a malicious actor may deliberately exploit the missing data to raise an accusation that he knows the accused will lack sufficient data to use in his defense. There may be scenario where the prosecutor in Baltimore may recognize the vulnerability of the police officer’s defense instead of proof of their guilt. This scenario pins the injury occurring during the period of police custody but that period also offered no exonerating data.
I think a better example is the deflate-gate controversy during the previous Superbowl where it was alleged that the Patriots deliberately deflated the footballs after officially registering the balls for proper inflation but before the balls were used in the game. The two observations were the official approval of proper inflation and the post-use observation of under-inflated balls. The evidence of deliberate deflation there was there was solid evidence that the balls started off perfectly inflated and ended up improperly under-inflated. The accusation was effective because there was no data (there was missing data) for the balls’ pressure during the interim. During the lead up to the Superball, this controversy became intense enough to raise concerns that the match up of the teams was illegitimate due to cheating. The accusation of cheating could not be readily defended from actual pressure measurements.
It is not hard to imagine more mischievous scenarios where a malefactor can recognize data that necessarily will be missing in an big-data store used for some analytic result. The strategy of attack is to make an inflammatory accusation of deliberate wrong-doing by recognizing there is no record for a change in some data essential for some analytic result. The lack of actual observational data (as opposed to model-generated data) to explain the data change can suggest cheating by the owner of the big data. The parties hurt by the recommendation from the analytic of that data could be inflamed by this accusation of cheating. In the context of businesses, this could result in lawsuits. In the context of government, this could result in protests or rebellion.
In many of the success stories from big data, part of the enthusiasm comes from recognizing how little data is needed. While we observe there is a large amount of data, we also recognize that the data has gaps. For example, the data may be sampled. The result was some success of improved profits or accurate predictions. However, these early big data successes also benefited from their obscurity. People were not aware of the data being used until after the project success was announced. More recently, big data projects have become more well known. Some projects involve continuous processes (such as program trading). In general the population is becoming aware that everything that leaves a digital trail and they are assuming that trail will be used in big data analytics somewhere. The more knowledgeable of this group may be aware of exactly what data is being used (such as twitter feeds, or facebook posts). This more knowledgeable group may anticipate the kind of analysis being done by a competitor (or merely a target of their wrath). They can easily identify missing data that at least conceivably change the analytics if specific values of this missing data were instead observed.
If someone wants to cause trouble for the big data owner, they can leverage the known missing data to raise accusations that the big data owner will not have any data to use in defense. The accusations can suggest cheating, fraud, criminal activities, etc that can harm reputations or invoke costly and lengthy investigations that can deny the owner of realizing the potential benefits of the big data analytics.