I considered my last project to involve a lot of data especially in context of the volume of data processed at the time (over 10 years ago). It was only recently when I heard the popularity of the term of Big Data as an emerging concept. My first reaction was to think that I had been developing my own implementations of Big Data in the sense that I custom built all of the pieces that now would be obtained by technologies like Hadoop or MapReduce. Because I didn’t pay any attention to these technologies at the time, I came up with my own approach that doesn’t translate well to the terminology. As a result I am backing away from describing as Big Data my experiences with large amounts of data, arriving in high volumes, needing to be processed continuously, and providing up to date analysis of that data. What I did appears a lot different from what I understand as the modern interpretation of Big Data.
My experience was simply that I worked with data. Even if it was measured in terabytes, I approached the project in an older tradition of focusing my energies on the content of the data and trying to understand whether it is telling me something that my client should know, or whether it is saying something is wrong about the data or how it was handled.
The goal of my project was to support decisions with long term consequences. I supported a group that was tasked to do simulation and modeling to support long term planning. With this objective, there is a much higher premium on the deep understanding and appreciation of the data and each of their life-cycles from their sensing to their final resting place in tables available for analysis.
Data used over long periods of time for planning for distant future objectives requires a diligent approach to continuously evaluate the data for its continued validity and relevance.
The popular depiction of Big Data concepts appears to be more focused on much shorter term objectives. Many of the promoted success stories involve marketing or financial successes achieved by exploiting discovered patterns before their competitors. Eventually, the competitors would discover the same pattern and eliminate that advantage. As a result, the value of the data is for much shorter term objectives. Because the decisions have a very short life span, the decisions occur much more frequently. With more frequent decisions there is more tolerance for any one decision to provide disappointing results as long as those results were not catastrophic. The performance of such efforts is measured as the average of many consecutive decisions. Success comes from making good decisions more frequently than making poor ones.
Contrast that with long term planning where there really is just one decision to make and everything depends on that one decision being right. I’m exaggerating a bit because in fact such decisions evolve over time. There is one decision, but it is refined over time with the assistance of better or more information. The point is that there will only be one achievement and it will either fail or succeed.
My experience working with large volumes of data from a wide variety of data sources does not easily translate to the more popular notions of big data. I began to see a difference when I read about one of the defining qualities of big data is that it involves three Vs: high volume, high velocity, and high variety.
I just now checked a Wikipedia entry on Big Data and see that definition modified to say “velocity and/or variety”. Also, curiously, someone added a sentence suggesting recent proposal to add a fourth V for veracity. That someone was not myself, but I very much approve. I note that at least at the time of this writing, the sentence suggests that only some organizations feel veracity is important.
There is a reason veracity or the conformity to facts is not a central part of Big Data. Veracity gets in the way of making quick and frequent decisions. If each decision needs to be checked to verify that the data for that particular decision is still a reliable descriptions of the facts in the world, then the decisions necessarily need to come much slower. This due diligence for data is what I call data science: the science of fully understanding the available data.
I described this problem in an earlier post when I compared treating data like a courtroom treats evidence for a trial. One thing about trials is that they take a long time. Part of the reason trials take a long time is because there is so much scrutiny on the evidence. Every piece of evidence is subject to challenges. Even seemingly infallible evidence is checked for possible flaws or misinterpretations. Veracity slows things down. But veracity is essential for long term planning and decision making. There is no opportunity to average across multiple decisions.
The large part of big data promotion is about volume, velocity, and variety. It is about making quick decisions to take advantage of the best guess of what is happening at the current moment. Such quick decision making informed by recent data promises benefits for sales, marketing, and finance. Fast decision making is essential in order to realize these benefits. The winning competitor is the first one that discovers and acts on an observed opportunity. Again, a consequence of this speed is that these decisions will occur at high frequency. What counts is the average results of being first with the right answer more often than being late or having the wrong answer. These projects have the luxury to tolerate the occasional mistake.
In earlier posts, I made a distinction of operational and historical data. Operational data is data that is relevant to current events so that there is still an opportunity to influence those events. Historical data is no longer relevant to current events but its purpose is to reconstruct what happened in the past, with the possibility of learning lessons to use in the future. In those earlier posts, I described operational data as present-tense science, and historical as past-tense science. These are very different intellectual disciplines.
The present tense scientist is very interested in the accuracy of data, but only that data that is still relevant to the present. As soon as the data becomes irrelevant, the present tense science has no further need for it.
The past tense science retrieves this dismissed irrelevant data and tries to reconstruct what happened. To the past tense scientist, historical data remains relevant indefinitely. Because this data is often obsolete operational data, there was a brief time when this data was considered to be the best representation of reality. However, the past-tense science will reevaluate this data and challenge its veracity with other information or newer ideas. The project of evaluating the correctness of data is never complete. We still argue over 2500 year old first hand historical accounts of battles.
Decision making for long term planning shares a lot with the past-tense science in that we need to continually reevaluate the data in order to be reassured the information matches reality. Old data is never immune to future efforts to discredit it.
The above model of operational data producing historical data as a by product is what I imagine when I think about big data. Operational data is small because its relevant data is strictly limited by the time of the current moment. Historical data is big because it is the indefinite accumulation of all of the once-operational data. This definition is from the perspective of decision making where there is only one decision to make, and there is no opportunity to average the outcomes of multiple decisions.
The source of my confusion is that in my mind the high-velocity aspects of big data essentially makes the big data project an operational project. Like operational systems, recent data is used to make frequent decisions to impact the current events. What is new about big data is that previously we specifically designed sensors to meet the specified requirements for the operational system. Big data instead reuses other or hand-me-down data very much like the data provided to historical science. The difference is that big data reuses this data in an operational setting.
The concept of velocity is to make frequent quick decisions to take advantage of very brief opportunities. To me, this defines an operational system. In an earlier post, I described how operational systems use feedback loops to tolerate errors for non-ideal sensors. Big data is effectively another type of sensor for a different kind of operational system. Used in this context, big data can tolerate errors. I mentioned earlier the tolerance that comes from averaging the results of a large number of decisions so that success depends merely on being right more often than being wrong. In practice, there are probably additional explicit feedback mechanisms at least to limit the losses from bad decisions.
Big data in the sense of the 3 Vs tolerates occasional failures of the data in order to participate in a operational mode. For operational systems, big data provides the innovation of introducing new types of data sensors based on what can be obtained from large volumes of a wide variety of data sources.
The motivation for this post is to address what kind of operational systems can tolerate for incorrect or irrelevant data in order to achieve higher velocity of frequent decision making. This is different from long term decision making where there would be much more investment in obtaining and verifying very relevant data.
High velocity data is meant to be used in a feedback look with the current events. The high volume and variety of the data is meant to get the broadest picture possible. Together, these suggest a strategy to find crowds in order to take advantage of the dense population of opportunities.
The stories I hear about are similar to the machine trading on wall street where there is an attempt to find the very beginning of a trend and make very short term transactions to take advantage of the time it takes for the trend to propagate to the entire population. The other is latch on as early as possible of a hot new trend so that advertising can get its biggest impact as the newcomers join the crowd.
Crowds are a dynamic phenomena. They grow rapidly and their presence initially encourages more growth. But they also drift as they grow, redirecting their attention or motivation as the crowd becomes more recognized for what it is and who it contains. The opportunity of taking advantage of the circumstances presented by a crowd requires some way to keep up with the crowd, to know where it is at the moment in size and attention.
A lot of what I see about big data appears to be focused on this collective behavior, to try find where the next action will occur and try to get there before the competitors get there. If they guess, right, they’ll win big. If they guess wrong, there is always the next opportunity.
Volume, velocity, variety is about providing a data feedback loop to locate and follow crowds. It occurs to me that a better term for big data may be crowd data. The data itself represents a crowd. The concepts of volume, velocity, and variety describes the data in big data, but they also describe the individuals in a crowd.
It may be useful to model the information in big data using the analogy of a crowd. Even if the big data is used for other purposes, the data objects are like their own crowd.
I’m thinking about large crowds of people as an analogy to large crowds of data. The people may gather for a specific purpose such as a very popular concert. The crowd as a whole changes its behavior as the concert proceeds with the initial gathering, the interaction with the concert and the neighboring audience, and the final dispersal. In terms of crowd behaviors no two concerts are alike, it all depends on the overall mood and conditions of the venue.
A more vivid example and perhaps a more relevant one is the image of spontaneous crowds as observed in recent political rallies involving popular uprisings or support for a very popular politician. The crowds assemble very quickly with little advance notice. Perhaps there is a core group that planned some gathering, but the large scale of the crowd comes as a surprise as more people join in either to follow their friends or to find out what the action is all about.
Although the crowd gives a unifying appearance of a lot of people in the same place at the same time, there may be little in common in the participants. Some may be there simply out of curiosity or out of obligation to accompany their friends. Some may get accidentally caught in the crowd by coincidentally being in the same spot. In very large crowds, there may be different subgroups who have very different impressions of what the crowd is all about. In political gatherings, some may have a specific political objective, while others may have more general disagreement with current regime, and still others may be in the crowd just so as to not be left out.
In either case, it is interesting to observe the progression of a crowd from its initiation to its usually inevitable dissipation. At some point the crowd disappears. For scheduled entertainment events, this is to be expected. But for political movements, this can be very frustrating when there is a change that results from the movement. The change was motivated by the impression that the crowd will sustain the change by continued widespread support. Instead, the entire movement usually evaporates, leaving only a change with no one around to do the work.
I make this analogy in terms of the goals of big data’s use of volume, velocity, and variety. The data that suggests a future of popularity may be behaving like a crowd. The next moment, the pattern disappears. It is not because we guessed wrong in the direction the crowd took. It is because the crowd itself completely disappeared. It is like hearing a big noise of a crowd but when reaching the source of the sound you only find empty streets.
The three V emphasis suggests to me that Crowd Data is more appropriate term than Big Data. Slowing down the process by imposing a fourth V (veracity) changes Crowd Data into Big Data.