Many of my earlier discussions about data science focused on dark data as deserving continuous scrutiny. Today I want to talk more about the idea of forbidden data. Like dark data, it has a valuable role in data science, but it also deserves continuous scrutiny.
I described dark data as something we invent to fill a gap in observations. I use the term dark to suggest it deserves scrutiny. Most commonly we don’t make a distinction at all. We may fill a missing value with a value generated from a model that we trust.
Often, all of the recorded values are smoothed or curve-fit values of underlying observations so every data point is model-generated. We make this choice because we decided we trust the model more than we trust the observations.
I use the term bright data to refer to very well documented measurements in well controlled settings. We can have confidence in that the measurement is of the specific phenomena and nothing else but that phenomena. This is a high bar to meet, so most data has some degree of uncertainty. I call this less trusted observation as dim data: it offers light but not at the intensity we desire.
A lot of data in historical data systems is dark data. The values admitted to the data system are smoothed or transformed versions of the actual observations. Sometimes all of the values are generated values that deviate to various degrees from the observed data point. In such cases, we may attach no significance at all to an interpolation of a missing observation, or we may not even recognize there is a missing observation.
Earlier I described a weather station scenario where temperatures are made periodically throughout the day. If temperatures are frequent enough, we expect that a temperature observation should be within narrow range of the values of its neighboring observations. A missing observation may be filled in with the average of the neighboring observations. Alternatively, the recorded observations are always averaged such as when we use a moving average of several successive measurements. In either case we are choosing to use a model to provide data points instead of using actual observations.
In practice, observations frequently deviate significantly from what we expect from our models. We may attribute the deviation to measurement errors or noise. We use smoothing in an attempt to remove these errors so they do not interfere with future analysis or interpretation of the data.
Plots of smoothed values against actual observations rarely show perfect alignment of the two. Because we use the smoothed values instead of the observed values, the amount of the deviation from the observation is rejected. With enough experience with a model, we usually don’t even pay attention to these deviations. The analysis products of the smoothed data are reliable and useful.
I distinguish this as dark data in order to emphasize that it deserves continuous scrutiny. A common scenario is where initial testing shows the deviations from the smoothing to be acceptable and over time we expect deviations. Later we may dismiss any deviations as expected and may miss something important. The original assessment of the acceptability of deviations may no longer by valid but we are so accustomed to deviations we don’t notice.
Another way to look at the deviations of observation from the admitted smoothed data is to call the deviation part as forbidden data. We forbid a portion of the measurement because it doesn’t meet our expectations. Sometimes we reject the entire observation as an outlier, often with no further explanation especially when we are rejecting a single observation or a trivially small minority of observations.
I want to attach value to those deviations and outliers by giving them the name of being forbidden data.
Forbidden data often occurs in multi-dimensional data where the combination of properties present a contradiction. For example, a fresh accumulation of a foot of snow is not consistent with a summer day with a minimum temperature of 90 degrees. If we encounter such an observation we don’t want to admit it into the data set for analysis.
In contrast to dark data, forbidden data is an actual observation. It arrived from established methods of collecting observations. While dark data is invented data that we use to fill in for observation data, forbidden data is data we invent reasons to reject from adding to our data set. In both cases, we use models.
Forbidden data is useful for data quality purposes. If summary statistics of these deviations are sufficiently small, this information can help build confidence in the data as a whole. We may observe these summaries over time to convince ourselves that the deviations are stable and reasonable.
Unusual or unacceptable deviations should trigger additional investigation efforts to find out root causes and find solutions to restore the expected trend. We may continue to feed the generated data to our data set while this investigation is ongoing or we may shut down the data feed until the problem can be addressed. If we continue to accept data despite unacceptable deviations, then we risk the possibility of needing to issue embarrassing corrections for any analysis based on the data. Alternatively, if we shut down the data feed until the investigation is complete, then we risk missing reporting deadlines.
This normal handling of forbidden data is an integral part of data-science. Like dark data, it usually requires significant human labor to perform the investigation. Like dark data, we often under-budget for this part of the effort. Our preferences for automated solutions may lead us to ignore the rejected forbidden data entirely. This preference is motivated by the desire to scale to even larger data sets or combinations of data sets.
In recent years, there has been tremendous improvements in technologies to handle ever larger data sets. These include both physical capacity and robust software algorithms. Big data solutions have grown to fill the capacity of the new technology.
My concern is that the limiting factor for the size of data is the human labor for continued scrutiny and investigation of dark and forbidden data. In both cases, we need to continually ask whether the models still appropriate for the data.
A popular anecdote is the story of the black swan. At one time, it was known that all swans were white. Thus any observation of a black swan would be forbidden observation. Then we discovered there is a population of black swans. Technology didn’t discover the reality of black swans. People did.
Many big data projects put too much trust in the technology. Historical data is data that at one time had some operational purpose. The general sense of the success of that process provides confidence in the data.
There is an inherent difference in goals of historical data and operational data. We transform recorded operational data into a form compatible for historical data. That transformation may involve introducing model-generated data or rejecting forbidden data.
I described earlier a good reason to identify and reject the forbidden data when it is used to trigger investigations for cause of the errors. That value is only realized if those investigations actually occur.
Ideally in my mind, there would be a continuous process to investigate every deviation or rejected data all the time. A lot of this can be automated by matching the deviation or forbidden value to an accepted model that accurately accounts for that error. Triggering a human-labor investigation may involve setting some threshold on what justifies that expense. The pressure to scale to ever large data sets biases us to set that threshold too high so that it is rarely exceeded.
We may set that threshold so high, we may miss the black swan. We reject from adding to our data set a legitimate observation of a swan that happens to have a formerly forbidden color.
Also there is a benefit of frequent investigations even when they confirm the results are acceptable. Frequent investigations allows investigators to practice their skills and become even more familiar with the details of various explanations for deviations. In contrast, a rarely activated team of investigators may waste time with less practiced skills are investigating unproductive paths of inquiries that a more experienced investigator would dismiss immediately. Ideally, an investigation team should be kept busy investigating differences between observations and models.
In the ways big data systems are used today, the risks are far higher than the simple example of missing a black swan. In our rush to push the envelope as to how large we can make the data set, we are likely to under-estimate or under-budget the human-labor cost to make sure that data is trustworthy. The cost of these solutions need to be justified by showing some contribution to some large scale mission. The rush to exploit the data for decision making may risk damages that might not otherwise have occurred without that data.
There is a reason to distinguish observed data from model-generated or model-rejected data. Models may earn respect over time by being very robust and reliable. However, models can become obsolete.
Raw observed data deserves its own kind of respect. It may be subject to errors or uncertainties of what exactly is being observed. But there is an inherent value to its being observed. There should be an effort to reconcile the differences of observations with models and compare the different error possibilities of observations and models.
There is another motivation to pay attention to the differences between observations and model-generated or model-rejected data. Even if we are satisfied with the appropriateness of the data for the mission it supports, we could benefit by discovering new hypotheses that can promote future innovations.
My previous post on evolution may provide an illustrative example. At the end of that post, I referenced a proposed hybridization hypothesis of origin of the human species. Underlying that hypothesis is an older debate about the nature of evolution. On one side is the theory that suggests gradual incremental changes until new species arise. On the other side is the fossil record the suggests sudden introduction of new forms that persist largely unchanged until they become extinct. The hybrid hypothesis is possible mechanism for introducing dramatically different forms. Whether or not the hybrid hypothesis works for humans is open for debate. My observation is that this opportunity to debate the model made possible the introduction of a new hypothesis that leads to a new inquiry to thoroughly itemize similarities with more than one other species.
The way we learn new things is that we allow ourselves to challenge model-generated or model-rejected data with observational data. This is a human activity in the area of a rich tradition of historical science. Big data is historical data. Data science is a part of the tradition of the science of history. That tradition is to demand attention to be paid to observations.