In earlier posts I suggested a better term for Big Data would be Crowd Data. The characteristics that distinguish big data from other data are similar to what distinguishes crowds from other aggregations of people. The challenges and risks are similar as well. Crowds can get unruly.
In a recent post, I suggested that the employment market for data scientists to deliver the promises of big data may be overly exaggerated setting up a generation of grads burdened with college debt burdened with insufficient data science work to employ them. That argument hinged on the typical under appreciation of macro economic capacity to find efficient use of computer science labor. In particular, the demand for big data solutions will be met by large solution vendors employing a relatively smaller number of jobs with skills in the data science specialty of computer science. Every big data consuming company is not going to staff up with software development teams to customize open source software in the same way merely to avoid licensing fees especially since licencing models are increasingly attractive.
I have another argument for the crash of the data science skill market. The entire market for big data is poised for a crash. That will happen in the first well publicized catastrophe of some bet based on big data going wrong through no fault of the application of data science. In short, our current optimism in Big Data is exaggerating the ability of big data to deliver only positive results.
I admit that I have not read the book Extraordinary Popular Delusions and the Madness of Crowds by Charles Mackay. I only glanced at the Wikipedia article because is the primary source of the discussion of Holland’s Tulip Mania in the 1630s. The actual facts of the event are controversial but I’m thinking of its legendary market mania that led to extraordinary speculative prices for tulip bulbs. It is often presented as the first case of a capitalist market bubble that ultimately devastated many investors.
The current interest in big data, predictive analytics, and data science certainly has an appearance of a mania. This is especially true in the job markets. New jobs are demanding precise breeds of computer scientists such as nearly a decade experience in specific technologies (languages and libraries) used against large data sets. This seems similar to the example of someone in 1630s offering 12 acres of land for a single bulb of certain breed of tulip.
In addition, many widely followed accounts on twitter or LinkedIn have career titles with big data or its various descendant technologies (or all at the same time). The content they are sharing is often a deluge of links to published articles or press releases about the game changing event that big data will have on every aspect of our lives. There isn’t a single thing in life that big data won’t change. I do encounter some articles that point to reasons to suspect this may not end well, but I don’t recall ever seeing these articles shared out by those who promote the field in the their career titles.
The potential of delusion comes with such a large community of people following each other’s announcements that always cover the same ground of the latest advances that promise even grander opportunities than imagined earlier. There is a larger group of people following these thought leaders with the hope for clue of how to get in on the action.
It is too early to tell if this will lead to some crash, but we can observe signs of mania. Big data is pervading all of our discussions from market speculation on disruptive companies (who exploit big data), to revolutionizing government (as in this article), to changing the way companies are operated. Top rated magazines, news sources, and web sites prominently promote the latest pronouncements of big data often accompanied by an eye-catching graphic visualization implying hugeness of data. The concept of big data has penetrated deep into the popular culture and the message is almost entirely positive in terms of big data being unable to fall apart.
There are negative articles about big data in terms of disrupting more cherished businesses or traditions or in terms of invading privacy or of being too intrusive. However even these articles are positive for big data in the sense that they assume that the intended project of big data succeeds in achieving its objectives. Even the dreaded Orwell’s vision of 1984’s big brother watching every key stroke is actually a success story in terms of the data technologies involved. We are encouraged to be confident of the inevitable success of big data even if we dread the consequences. For those hoping for a promising lifelong career, they may look forward to a successful career that enables the creation of absolute authoritarian state. The point is not whether it will happen, but that if it were tried we are confident it will succeed. Such is our confidence in the promise of big data.
My bet is that our optimism will crash in the not too distant future. Probably in time for the 2016 presidential elections. In USA, these elections tend to provide the needles that pop bubbles.
Right now, all of the news is about case studies of early adopters who enjoyed big benefits of early adoption of some big data concept. Often these stories are presented as company X used big data to gain a huge market share at the expense of its competitors. Successor stories recommend that all companies need to use big data in order to remain competitive. We are the stage of big data promotion where the benefit is the prevention of loss of its market share instead of the achievement of greater success. Companies need to invest in order to keep what they currently enjoy in terms of competitiveness. It appears companies are heeding this message.
The bubble bursting event will come when there is a spectacular failure that is traced to a reliance on well accepted prediction algorithms on well accepted data. Bubble will burst when we realize we can not rely on algorithms to make decisions.
The big data mania is driven by a vision of machines making decisions based on data that encompasses everything of relevance to the decision maker. Certainly there will be some human executive making the decision official, but he will have little choice but to obey the one justifiable decision illumined richly by some machine-generated visualization of all of the relevant data.
Sometime soon, someone is going to bet everything on what turns out to be nothing. It will catch everyone’s attention like the Enron fall from being the worlds most innovative company to a failure that we now attribute to a fraud. There collapse of big data mania will be precipitated be a similar collapse. The collapse will get everyone’s attention. The pronouncement of underlying fraud will poison the entire big data industry.
Fraud is a difficult term because it seems always to be applied in retrospect. We observe a fraud after it is found to have occurred. Occasionally, we may observe fraud as it is occurring, but usually we are tipped off by a history of something that we consider fraudulent. We rarely observe a fraud before it is executed. We recognize fraud by its consequences.
I do not doubt that big data is being employed today with fraudulent intentions. So far the fraud has not been apparent in large scale. It seems most current frauds are small players exploiting side effects of algorithms. Actually, it is unfair to call a fraud the natural consequence of using an algorithm that can be exploited. The term fraud itself may apply to such practices in retrospect when someone prosecutes a case out of it. Until it is labeled as a fraud, we should accept that someone will exploit for his own gain his ability to predict what a predictable algorithm will do.
In an earlier post, I commented on a presentation of a big data demonstration of managing a commuter train system. This system relies on camera to observe and quantify crowds. This information can trigger algorithms to adjust train capacity based on the crowds. Using this example, a big data hacker could organize flash mobs to show up at the same time at different parts of the system and trick the algorithms to allocate more resources to carry the passengers who depart before the trains arrive. The data hacker can then walk to a station and travel in comfort with the comfort of having plenty of seats to choose from.
In many earlier posts, I discussed the study of data (I later named it dedomenology to distinguish it from the computer science use of data science) in terms of the ways that data can deceive us from what is really happening in the real world. The motivation of those posts was to defend the need for labor to scrutinize and criticize data to be sure it is relevant and that it remains relevant in context of inevitable changes that occur everywhere. In those posts, I suggested this scrutiny labor presents a major constraint on the size of data because the scrutiny involves humans in the time consuming task of studying and arguing. In many cases, the goals of big data can only be achieved by bypassing this scrutiny requirement especially during the production phase of the project.
Big data projects rely on the principle that preproduction testing will eliminate any faults in algorithms’ ability to properly handle the data. Production is automated with minimal demands for operator oversight. We can’t afford to have in production the data scrutiny activities analogous to lengthy deliberative debates seen in courtrooms or academic disciplines. Underlying this assumption is that such scrutiny is not needed in production. I argued in earlier posts that this scrutiny is needed during production.
If something objectionable happens, we may interpret the avoidance of this scrutiny during production as a form of negligence. It is a matter of future public opinion whether the omission of production-data scrutiny will be condemned as a fraud.
Eventually a big player will stumble due to a practice we may later disapprove and describe as fraudulent. Certainly, a big failure is inevitable as a consequence of normal business cycles. Also certainly, we’ll investigate to figure out what went wrong. It is increasingly likely that the failed company relied on algorithms and data. Ultimately, we’ll blame the algorithms and the data. It seems inevitable that the popular opinion will condemn as a fraud the specific practice that led to the failure. Because the practice is not unusual, the condemnation will poison the entire industry.
Big data practices could follow a similar trajectory of the mania followed by crash experienced by the sub-prime lending practices that rattled the entire banking industry in 2008 (coincidentally an US presidential election year).
The effect of the failure will not be fatal to the industry but it will definitely diminish the public’s confidence that big data can not fail. We’ll be imposing more regulations on the practices of big data and on its practitioners. Alternatively, investors will be more cautious when it is apparent that big failures are possible. Certainly public opinion will not be as friendly.
There will be fewer jobs.
Edited 4/6/2015: Corrected year of banking crisis from 2004 to 2008.