In earlier posts, I distinguished data into two categories: operational and historical. Operational, or present-tense, data is about obtaining the best record of what is occurring in the immediate moment. Ideally, operational data is well documented and controlled observations of the real world so that information may be used to influence the events in a positive manner. To remain relevant, the operational system must refresh these observations periodically and eventually disregard the obsolete observations. Historical data is archived obsolete operational data, the copy of the data after it becomes irrelevant to influence current events. I also described historical data as data that can no longer be recreated with a fresh observation from a more optimal perspective or sensor. Historical data are inherently limited by the existing record.
I focused many of my early thoughts of data science on the interpretation of the historical data. I equated historical data to evidence encountered by other disciplines that have very well established methods for scrutinizing the evidence. These disciplines have developed over many generations to specialize in addressing the challenges of available evidence. The nature of scrutiny for recent data encountered by disciplines such as auditing or accident investigations may differ from the scrutiny of ancient data encountered by historians and archaeologists. With older data there are more questions about the interpretation of the data (did words mean the same then as they do now) and there are more predecessor interpretations that need to be reconciled (what did they get wrong). The project of scrutinizing limited evidence is similar for both recent and old evidence.
My thinking of data science has been to focus on data from a historical perspective. I placed data science as a member of the historical sciences. I suggested that all data instantly becomes historical once it gets a time-stamp. While it is possible to repeat a measure of the temperature of boiling water, it is impossible to recreate the measure that has the same time-stamp of the measure taken a minute ago. The time-stamp makes the data historical. I used this definition to defend the notion that human practices (that I call data science) of interpreting historical data should apply even in real time scenarios.
While I continue to hold this view, it is not a winning argument because human scrutiny of data necessarily slows down the project of using data in real time. Real time data needs to get the human data scientist out of the loop.
The recent concepts of big data focus on the technical challenges of handling large volumes of data arriving at high rates (the volume and velocity problem). I attempted to argue that big data is a technical issue of handling the size of the data and that the project of interpreting the data remains a discipline involving trained human labor. The analyst needs tools to handle the data, but he still has the job of discovering and defending interpretations using the time-proven community-based practices of interpreting historical data. These practices involve formal rhetorical skills of building, presenting, and defending arguments based on the available evidence.
This way of thinking about big data from a historical science perspective characterizes my own experience of preparing reports today of what happened yesterday. Yesterday will never be repeated and all I have to work with is the data that was captured and not lost before being delivered to me. My task was to make the best interpretation of the events of the previous day based on the available data. The focus of this effort was to make long term planning decisions. Long term may be take the form of being based on now that we understand what happened yesterday, what should we do today.
Repeated every day, my project was able to keep up with real time, but it was not a real time project. My project made no attempt to influence the current events. Instead it was to influence future events based on past events. The objective of the current data collection was to observe the natural system under the current conditions.
Many modern big data projects are more properly described as real-time projects because they apply interpretations from big data immediately to influence the current events. Early examples of real-time big data include stock-market program trading algorithms and internet advertisement placement algorithms. These algorithms attempt to take advantage of short term trends to obtain beneficial results. To to this, they must process data quickly and execute decisions automatically without human decision making for approving each individual decision.
Big data for real time purposes is different from big data for historical analysis purposes because the former becomes part of the system being measured. Real-time big data becomes part of the system and fundamentally changes the system dynamics by introducing a feedback loop. A natural system without an artificial feedback behaves differently than that same system with an artificial feedback. In well controlled private systems, we can engineer feedback to be stable and beneficial. This engineering designs a unified feedback approach with the recognition that feedback can become unstable and damaging.
Many big data project targets the common uncontrolled public systems of the general population. There are a lot of competing and independent efforts to exploit feedback to seek advantages from this common system. Each one of these real-time big data feedback mechanisms are fundamentally changing the behavior of the system as a whole. Eventually, the system’s behavior becomes defined by the unknown combinations of independently applied feedback systems. In other words, the collected data entering the big data becomes more about other big data than about how the world would be without these feedback loops. The behavior of the world fundamentally changes by the behavior of multitude of applied the data algorithms. This is especially true in the context of social systems.
The early success stories of big data solutions used statistical models of uninformed social systems and were successfully applied in secret without other competing algorithms influencing the results. Now everyone is jumping in, attempting to reuse those similar statistical models of prior understanding of social systems. The introduction of these feedback fundamentally changes the behavior of the system. The feedback itself is a form of informing the social system that was previously uninformed. In addition, the members of the social systems are increasing becoming aware and informed of the big data projects and changing their behaviors accordingly. Because the big data projects rapidly are changing the behavior of social systems, that they no longer have the advantage of leveraging older social theories that took a long time to develop. The newly introduced feedback mechanisms from real-time big data projects effectively invalidate the prior theories of social behavior. The social world is no longer as well understood.
Using big data for a feedback on social systems will change the behavior of the social system that all of the big data systems will be attempting to measure. The feedback itself has a multiplier effect on information. Depending on implementation, the feedback may attenuate some information and exaggerate other information.
In the examples where a private system has a single centralized planned form of feedback, the engineer designs the feedback with the right multiplier effect to achieve the desired results of a stable and beneficial behavior. In contrast, the modern big data solutions introduce haphazard and unknowable number of feedback with arbitrarily and even careless selection of multiplying factors where each system has a singular objective of private exploitation without regard to stability. From a control systems perspective, this seems to be an invitation for disaster.
I mentioned earlier that big data measurements of a common social system will increasingly measure the effects of feedback from other big data systems. There may be some comfort in the fact that the other systems strive to obtain accurate observations. Accurate measurements of a system modified by feedback based on accurate measurements suggest that a successive feedback is not a new or competing feedback but instead a modification of a single feedback. While it might be difficult to model mathematically, we can aggregate all of these implementations as a single feedback mechanism for the social system that uses accurate observations of the fundamental behavior defined by human social and psychological truths. An additional big data project will only modify this existing feedback.
This is occurring in the now old techniques for program trading of stock markets where the programs now are anticipating what the other programs are doing in order to find a way to exploit an optimization opportunity made possible by the competitor’s optimization algorithm. In the end, there is one feedback for one system (the overall market) but no one is quite sure exactly what the feedback looks like.
The same thing can happen in the influencing social behaviors in general. The different marketing activities are increasingly taking into account the fact that the measured behaviors are being influenced by competing marketing activities. Again, it is conceivable we can trace back the various feedback systems to satisfy ourselves that ultimately the measurements capture reality of true human social behaviors.
I distinguish a historical big data solution from a real-time big data solution. They really are very different projects. For historical big data solutions, we have the opportunity to invest in scrutiny to attack and defend the evidence in order to come up with the most accepted version of what really happened. In contrasts real-time big data solutions require application of immediately obtained data to immediate events. Real-time big data can not afford real-time data scrutiny. Real time data projects must act immediately on the data based on confidence of trusting the wisdom and foresight of the developers of the system.
One key distinction of the two projects is the tolerance for missing data. The disciplines of historical data analysis include the practices for identifying and handling missing data. We have methods for accounting for missing observations due to sensor failure or mishandling of the data before it arrives at the data store. We employ methods to recognize that entire perspectives are missing where a better placed sensor or witness would have provided key information. We use the missing data in our analysis by the explicit recognition that the data is missing. We scrutinize the available data with explicit acknowledgement that we are missing data.
Historical data analysis makes its recommendation of long-duration data collections that are free from our interference. As I described my previous assignment, I passively observed yesterday’s reality so I can analyze it in total in order to recommend what to do tomorrow. Time frames may vary from days to hours or months, but the concept is the same. Allow the system to behave without interference in order to collect data about the natural system and then making decisions to influence the systems natural behavior. An ideal recommendation for a historical data analysis is one that permanently changes the behavior so we no longer have to interfere. An example may be to design a flood-control system that only needs to be designed once.
Real time data systems use analytic algorithms to prepare predictive or prescriptive results to apply to the current data. These algorithms will automatically handle missing data in order to complete the algorithm. Often the real time data systems will introduce manufactured data to fill in the gaps in the data in order to complete the analysis. Manufactured data is data that is not observed by instead computed from pre-existing theories about how the world ought to work given the data that has been observed. Model generated data may include interpolation or extrapolation of curve-fitted data or they may include dynamic simulations to fill in data by dead reckoning. The models may introduce data explicitly by adding data points to the data provided to the algorithms, or implicitly by the algorithm itself. Model generated data has the potential of producing a different result than what would have happened if we had a real observation.
I noted earlier that the real-time big data solutions introduce feedback loops into the current events so that future events are a combination of the natural system and the artificially introduced feedback. The real-time big data changes the reality that we are attempting to measure in a fundamentally more efficient way than is possible with historical big data analysis. Initially, this seems reasonable because the big data system is using actual observed data that theoretically is accurate and trusted. Although we are introducing feedback, we base that feedback on information observed earlier from reality. The problem is that some of the information fed back is manufactured data rather than observed data. The feedback will include model-generated data to substitute for missing observations from gaps in an observing sensor or from entirely missing observing sensors. The predictive algorithms will need manufactured data to fill in for missing data in order to complete their computations. The result is a feedback signal that includes artificial information. Subsequent measurements will be a mixture of the natural system plus the artificial data. These subsequent measurements will be treated as reality but that reality now is of an artificial feedback-controlled system rather than a previously understood natural system.
The contrast between real-time big data analysis and historical data analysis may be illustrated allegorically by comparing the modern phenomena of 24-hour news programming and the older tradition of deliberative investigations. An example may a criminal act. In an earlier post I described a minor crime of multiple instances of people discovering their cars tipped over in their parking spots. In that post, I noted the indisputable evidence that the cars were tipped over while in their parking spots. But there was missing data in that we don’t know how it happened. Given the recent mild weather and seismic conditions, we can infer that some human actors were involved. This presumption alone is introducing a data point (humans did the tipping) that was not actually observed. In the post I pointed out a more controversial observation of seeing a group of individuals in the vicinity of a recently tipped car. The manufactured data point is that this was being done by groups of people who have a certain motivation.
The difference between real-time and historical data is illustrated by the difference of how the above observations are handled between the 24-hour news reporting cycle and the more deliberate handling of traditional criminal or civil investigations. The 24-hour news reporting will (and did) report the suspicious behaviors of the coincidentally observed group of individuals to at least speculate that this is motivated by an animosity toward a particular make of vehicle. This is one of many possible explanations including a rivalry between individuals where one group happens to own a particular type of car. Including the reporting of the speculation into a potentially popular news story introduces the possibility of a feedback by suggesting a trend that in fact had not yet started. The feedback of the news would transform a random event of a feud between individuals into a method of expressing an opinion about a particular vehicle. If the trend had grown there would be more news stories to confirm it. The reporting of the manufactured data as part of the news would have provided the root cause of the trend that did not exist before that reporting occurred. The subsequent news cycles would have even more reports of car tipping to report. They would be reporting real observations that were derived from an artificial piece of information.
I contrast the news reporting (as an analogy to real-time big data) with the traditional investigation approach that would gather evidence and witness testimony and then scrutinize that evidence along with the known missing evidence to conclude what might have happened. Even if the conclusion is similar, the conclusion occurs much later and probably would be contradicted by the fact the intervening time showed no evidence of an ongoing or spreading trend. The conclusion would also be presented with an acknowledgement of the uncertainty involved and further blunting the possible explanations. While the traditional approach analyzes the available evidence, the system continues to operate naturally without influence of the investigation. Subsequent measurements will not be tainted by feedback information as it would be with the news reporting.
While a creation of a trend for tipping a relatively rare car and causing relatively minor damage is hardly a disaster, it is easy to see more drastic and nearly disastrous results being generated by news cycles introducing manufactured information to fill in missing data that will require weeks or months of investigations to conclusively resolve. The manufactured data based on models and presumptions are generating new newsworthy events that reinforce the models and presumptions that may in fact be completely unrelated to what actually transpired in the first place. By injecting manufactured information, the news reporting will produce news that confirms that manufactured information.
Feedback is can magnify or attenuate a particular piece of information. An example of a magnifying feedback follows. A manufactured piece of information about distrust of legal and government institutions can provoke riots and riot responses that in turn reinforces the suspicion of distrust in the legal and government institutions. While there may be imperfection in the systems of law and government, they are probably not that so bad as to require complete dismantling. Yet the introduction and magnification of manufactured data can ultimately cause the entire system to fall apart with disastrous results.
Again, I’m using the news cycle versus legal processes as an analogy to big data but I think the same effects are possible for the same reasons. As news reports are subject to immediate unverified speculation by on-camera reporters, so too are predictive algorithms of real-time big data subject to immediate and unverified speculation by algorithm developers given the excessively lofty title of being data scientists. The algorithm will include biases about how the world should work and these biases will introduce manufactured data to fill in the missing data. This manufactured data will enter as a feedback signal into the system that could respond in a way that conforms to the manufactured information.
The risk of real-time big data is the possibility of encouraging disastrous outcomes based on conditions it manufactured and would not have happened had the systems not been in place.