Recently, there have been promotions of big data as offering the potential of solving nearly every ill mankind faces. With abundant data and rapid algorithms, we can discover patterns so quickly we can intervene when there is still a chance to prevent a tragic result.
While there is beneficial potential of lots of data and technical capacities to process it, the big data project may hinder progress by restraining debate to what is considered to be permissible data to support a single version of truth.
In any earlier post, I discussed a conflict of interest between the goals of a data warehouse and of forensic data needed for operations.
The data warehouse has a goal typically described as presenting a single version of truth. The data in a data warehouse are the consensus accepted best available about any fact. Also, the relationships between the data in the data warehouse have to have established and fixed referential integrity. In particular, the concept of a single version of truth is that there will never be conflicting data. An alternative description of the data warehouse goal is to present as single authoritative data source that resolves all questions. The goal of data warehouses is to end arguments.
It is similar to a growing trend within our culture of wishing arguments will simply go away. There should be a single truth that we can all agree on. I think much of the hype of promise of big data efforts has as its foundation this faith that there is a single version of truth available to us. What we need is the technology to make this universal truth established once and for all the be authoritative to end disagreements about alternative conflicting data.
In contrast, the goals of business operations is a lot different. Operations must confront conflicting information and resolve these conflicts. Operations must master the problem of multiple versions of the truth. An alarm may go off due to some measure going beyond some threshold but the alarm may itself be faulty, or there is an underlying cause of the fault that is far removed from the source of the alarm.
In IT data systems, there is a recent boom in SIEM (security information event management) tools that specialize in operational data involving collection of data that can have overlapping and conflicting information. Although SIEM uses big data technologies it uses it distinctly differently from data warehouses. In particular, the goal is not to resolve an argument with a single agreed version of truth. Instead the goal is to facilitate investigations into the various competing versions of truth to uncover an understanding (a truth) that may not even be in the data.
The difference between data warehouses and SIEM is similar to my discussions of the differences of historic data (what actually happened) and of missed opportunities (what could have happened). The SIEM focus is in determining what actually happened based on all the possibilities of what could have happened. The data warehouse focuses on capturing what actually happened. As I mentioned in earlier posts such as here, the determination of what actually happened comes from a process that involves hypothesis discovery through investigating all of the possibilities of what might have happened.
The possibilities of what might have happened include raw data. As I mentioned above, the possibilities also includes missing data. The available data may imply that something occurred that did not get recorded as data. For example, we may have data that informs us that several servers failed in a short period of time and then later determine that there might be a problem with the cooling of the room housing the servers at a time when room temperature was not automatically monitored.
The fundamental difference between data warehouse and SIEM (and its peers elsewhere) is the treatment of clean data. Data warehouses demand clean data. SIEM (etc) accept all data as is. For forensic purposes, we need access to all data and we need to be free to rearrange and pre-prioritize that data to fit them together as needed to derive what most likely was the actual event that might not have a record of its own.
The ETL (extraction, transformation, load) operations for the two projects are very different. For data warehouses, there are strict business rules that enforce the integrity of the the data. For SIEM-like tools, all data must be preserved as close to its original form as possible. For SIEM-like tools, we need not only the data but the context of the data,
In recent posts, I frequently equate historical data with the legal term of evidence. I want to treat data in a way that is analogous to how evidence is treated. I am sure there will soon be increasing serious legal scrutiny about the handling of the data as evidence to support decision making especially if the decision making causes adverse impact on some party.
The project of the data warehouse to present a single version of truth opens it to challenge about its veracity. Unfortunately, the data warehouse will be poorly equipped to address the challenge because it immediately eliminates the dirt in data that presented less credible versions of the truth. In contrast, the SIEM approach goal is to preserve all data. The SIEM approach indefinitely postpones the decision making of what actually happened. The SIEM approach offers the perpetual opportunity to change opinions of what happens because is preserves equal access to multiple versions of the truth.
Changing opinions involves arguments. As along as there are multiple versions of the truth, there will be disagreements about the interpretations of which is closer to what actually happened. Arguing is a core part of the forensic project and it is very comparable to the types of academic arguments about interpreting historical evidence and to the types of arguments encountered in courts. For this reason I described data science (later named it dedomenology) as a member of the historical sciences where the focus was on interpreting ambiguous evidence.
This view of data as evidence has a consequence about what is admitted into long term storage. If all data is evidence then all data needs to be preserved. Also, as evidence the data should be stored as close as possible to its original form. This is analogous to storing crime scene evidence where the cleaning is minimized to whatever is necessary to assure manageable handling and long term preservation. Much of the evidence may be dismissed as irrelevant but even irrelevant evidence may be retained in cases of possible appeals. The evidence may be saved indefinitely even if it is not consistent with the accepted notion of truth.
Also, this view of data as evidence introduces the notion that the arguments are never settled. We may exhaust our resources for evaluating evidence so that the last decision stands, but we always reserve the option to revisit old cases to reinterpret the evidence when new information becomes available. The argument is perpetual.
To support these arguments, the data needs to be stored in a fashion the does not eliminate their availability. We must permit indefinite storage of data that is not consistent with other data. We must postpone imposition of integrity constraints on data to allow the opportunity to rearrange the data for new understanding. This demands permitting loose relationships between data or postponing the imposition of relationships until the evidence is studied directly.
In contrast, a fundamental goal of data warehouses is to end arguments. The goal of a single version of truth is to eliminate conflicting facts that can fuel competing sides of arguments. The design of data warehouses included to goal of meeting ACID and data governance tests to assure integrity of the data. Big data projects increasingly demand ACID and data governance criteria for the data in their stores. The goal is to select the best data at the start of the process so that only the best data is available for retrieval.
Increasingly, we envision big data as clean data and as a single version of truth. With a clean single version of truth, we can confidently run algorithms that will unambiguously present a solution with compelling visualization that everyone can immediately appreciate. In order for big data to enable faster decision making, it must present to predictive analytics an unimpeachable version of the truth.
As I observe the crescendo of optimism of how big data will solve all human problems, I imagine that this is actually an optimism that big data will end all arguments. In modern times, there is a definite culture of aversion of arguments of any firm. For various controversial topics with some degree of conflicting evidence, we hear increasing frequent appeals to an assertion that the argument is decided. The appeals rely on an appeal to authority (usually in the form of scientific consensus). With big data, we have the opportunity of a new appeal, the appeal to big data. The comprehensive collection of a singular versions of truths confers respect to some well visualized result of predictive analytics. I described this earlier as a rhetorical fallacy of appeal to big data.
The prospect that big data (big clean consistent single version of truth data) solving any conceivable problems is in part an expectation that it will end all arguments. We can get to the business of acting on the decisions more quickly because the data is universally accepted as representing truth.
The highest priority of desire by modern society is to end arguments. We may never solve what makes us miserable, or we may even accept becoming more miserable. But we’d be happier if only we would stop arguing over different versions of the truth. Big data may lead to a secular form of theocracy where we are conditioned not to argue the facts but accept whatever comes from the interpreters (analytics) of the big book of data.
To return to the title of this post, I had earlier proposed an alternative name for big data to be crowd data in order to set a different expectation. That earlier description focused on the analogy of large amounts of diverse and rapidly evolving data to be similar to crowds of people. Another illusion is that crowds can be difficult to understand. For example of a particular gathering of a crowd, there is likely to be a wide variety of motivations or explanations for why the crowd gathered in the first place, or why it persists. In an earlier post, I described the recent phenomena of flash mobs who were motivated by responding to text messages on cell phones where I believe there will be a major lack of consensus about what is actually going on.
In this post, I wonder what is really more likely to solve problems. Big data with its emphasis on comprehensive clean single version of truth has the prospect of ending arguments. In contrast, the dirty data or crowd data accepted by projects exemplified in my experience by SIEM tools has the prospect of fostering deeper and more protracted arguments. Ultimately, the question is what will likely cure the ills that we face: the elimination of arguments or the promotion of more arguments.
I recall a lecture on informal logic that characterizes the practice of rhetoric. The lecture presented the view that we are usually well served by the practice of argument without rhetorical or logical fallacies to persuade a community of an action. The entire project is informal because it rests on this final element of persuasion. The lecture asserted that over time the mechanisms of informal logic and rhetorical persuasion more often than not results in beneficial progress despite occasional disasters. The discipline of rhetoric sets out to identify fallacies that explain many bad persuasions. Many of the rhetorical fallacies involve the closing of an argument by other means than presenting a compelling argument.
The appeal to big data’s single version of truth may turn out to be recognized as a future rhetorical fallacy.
If the project of big data is to shut down debate, then it is not going to cure anything.