This post compares and contrasts my definition of dark data with earlier uses of the same term.
In many of my earlier posts I discussed dark data as non-observed data that instead is generated by models or assumptions. I used the term dark in analogy to astronomy’s dark matter and dark energy. Theory predicts it exists but it has not been observed. We use theories to extrapolate from observations to fill in for missing observations. This happens very frequently. It deserves a separate consideration from actually observed data.
My claim is that dark (model-generated) data does not get the scrutiny it deserves because we trust the theories so much. If the project of data mining is to discover new hypotheses from observed data, then the project is biased by the data being contaminated by previous theories.
All of this is my own free thinking about the problems I encountered working with data. In particular, I was trying to address the need for routine specialized labor to continuously scrutinize data and its transformation path into the data store in order to catch real-world changes that no longer conform to earlier theories. This is not easy and yet it does not get an adequate budget to be sure the data is good. The result under-funded scrutiny efforts should be a gradual decline in confidence in the data, if anyone were to be paying attention.
There is alternative definition of dark data that predates my definition. This refers to actual observed data that is found in data stores but has never used, tested, validated, verified. Unlike my definition where there is at least some confidence in the model generating the missing data, this data just is sitting in the data store with little if any support for its authenticity.
When I first learned of this definition of dark data, I was reminded of an analogy in telecommunication of a dark fiber: an individual strand of fiber that is not yet used, or lit by a laser source. Another word for dark fiber is unlit fiber. That appears to be very much the same sense of this definition of dark data as data that has never been examined. To distinguish the two terms for this post, I will use unlit data to refer to this sense of observed but not examined data. When I refer to unlit data, I’m referring to this alternative definition of dark data that I admit predates my definition.
The use of unlit data also fits in with my definition of bright data being the opposite of dark data. Bright data is well documented and well controlled observations that we have high confidence is a measurement of something specific and nothing else. I suggested that real observational data is never perfectly bright in that it has some degree of uncertainty. I proposed the term dim data to refer to real observations that have some problems in documentation or control. But even dim data is can be bright compared with missing observations filled in by models: dark data. The term unlit data fits in that continuum of bright data. Unlit data is observed data that lacks documentation or control. It is like encountering a new species deep inside a previously unexplored cave. Until we found it, it lived in the lack of light. It is real, but we don’t know anything about it.
At this point, I would restate the title of this post as dark data vs unlit data.
I summarized above and described more in depth in earlier posts my interests in scrutinizing dark data. For the project of hypothesis discovery, the existence of non-observed but model-generated data can be misleading. More importantly, the model-generated data may using out-dated models that can hide changes in reality and thus hide evidence that could be used to create new hypotheses. In particular, I described the use of models to determine exceptional or forbidden data that should be logged as errors instead of introduced into the data store. Perhaps the data is right and the models should be forbidden.
Unlit data has its own problems. Unlit data is very common and it does propagate into historical data stores, sometimes deliberately, sometimes unintentionally. Once in the data store, it is available to the analyst who may assume he can trust it just like any other type of data. This is dangerous because unlit data has never really been documented or validated.
Unintentionally introduced unlit data is usually data that is embedded in a larger record. Consider a data record that is a scanned image of some receipt. The color of the pen ink may be unlit data. It is data that is introduced into the record but it has not been scrutinized earlier when all of the information was used in a color blind sense. Because this data is available in the historical data store, the analyst may consider querying on all receipts using a particular shade of blue pen ink. This can be dangerous. Perhaps the receipts are scanned by different scanners with different quality of color reproduction, or perhaps the scanner has some algorithm to change colors to distinguish typeface information from hand-written information. We don’t know because we don’t have any documentation or control of this information. And yet it is available to the analyst. Data available to the analyst will eventually be used by some analyst.
I don’t mean to be too alarmist about this. Typically this kind of data is machine generated codes that have little or no use in operations. Even if we lack documentation of these codes, we can infer that they are used consistently or predictably so that they may serve some use later on. But the same can be said about data generated from well-trusted models. Most of the time it is what we think it is. The problem is that sometimes it is not, and those times can embarrass us.
The initial data-science engineering of introducing a new source of data involves a detailed inspection of the new data. During this investigation, we seek unique or near-unique identifiers, attributes that can be used to link to other data sources, attributes that could be stable dimensions or attributes that can be aggregated into some form of measurements. Ideally, we would work with documentation of this data, making it dim data at least. Often, we will reverse-engineer the documentation of unlit data based on a sufficient sample of the data.
This reverse-engineered documentation permits deliberate introduction of unlit data into the data store. This is part of the art of the practice of data science. When dealing with a wide variety of sources for the same type of data as well as a large numbers of types of data, there will be a substantial portion of data that require reverse engineering its documentation based on the available sample data.
A large part of the my last project involved working with raw data that included a lot of unlit data.
Often this data was ignored. The goal of extracting the verified observational data from a larger set implies filtering out of unlit data. This unlit data never makes it into the historical data store. The analyst will never encounter this data. This can be a good thing because he will not confront unverified data. It could be a bad thing because the data might have been useful at some point in the future.
I note that my project involved what I call a data life cycle of successive stages of data cleansing, aggregation, and validation. My project worked with structured databases that has less room for unlit data than unstructured or free-text stores. I contrast my project from those big data projects involving bulk storage of raw unstructured data. Such data stores do not filter out any unlit data and thus permits access to this data at any time in the future. My focus was on building projects several steps beyond this unstructured data although I used this unstructured data as a data source. The above mentioned filtering process is extracting verifiable or well-documented information from this type of unstructured bulk data store.
All that said, I still had some types of data that inevitably embedded unlit data. A good example is a human operator’s manual input field that may include typographic errors, stylistic differences, or grammatical errors that could be explored as unlit data. In fact, I often exploited this embedded unlit information for useful purposes. For example, I used the comparison of styles over multiple entries to show a particular inconsistency to provide feedback to the operational teams to improve their practices. Another example was to compare a machine-generate value with the operator’s intention recorded in the free-form field. Often this provides a clue that there may be an error and often we would find the error in the data path of the machine-generated value: for example, the machine-generated value may be an old out-of-date observation.
These examples suggest that unintentionally added unlit data can be exploited with useful consequences. I agree with this but only in the sense that the analyst is highly skilled and experienced to recognize that unlit data is not the same a bright (well documented, controlled, and validated) data. The reason to be cautious about unintentionally included unlit data is the risk of misinterpretation. Such data is not as trustworthy as documented and controlled observations, dim or bright data.
In the above example about the human operator free-text field, there is a need to understand something about the operator’s environment and instructions. Sometimes errors in the free-form field may be the result of instruction the operator must follow. We should not immediately assume that a free-text manual entry field will contain information about the operator’s knowledge. Sometimes the manual entry is for transferring information where automation is not possible. Compared to well documented data, unlit data requires more diligent research or understanding of its limitations.
The third destiny of unlit data is the intentional introduction of unlit data in order to exploit it as a key, a dimension, or a measure. I dealt this this problem a lot. I worked with unstructured or partially structured data that was not well documented. Instead I had to infer the documentation from the data itself, or often by comparing recent data with historical data to figure out what might have changed due to an unannounced change (replacement, upgrade, or reconfiguration).
This is undocumented data. Often it useful information that the data source does not support or document, and the data source reserves the right to change at any time without notice. Without any commitment from the source, this is clearly a cause for caution. This is “use at your own risk” data. We gain some confidence in this data through our own documentation and verification through empirical testing. We figure out what the data means by looking at how it appears in context of the larger data set.
This is often a very fruitful exercise. Using this data can provide information that is not available from any other means. However, unlike well documented and controlled data, it is much more subject to change without any advanced notice. This is more of a problem because such unannounced changes are rare, occurring suddenly after long periods of stability. This stability makes the data useful, but the sudden change makes it risky.
In terms of requirements for routine data science labor, unlit data is much like my definition of dark data. In order to make use of this undocumented data, we need to derive a theory of what is means. The difference is that model-generated data is an invented observation from a highly trusted model. Unlit data is a real observation that is recorded exactly as it is delivered, but the interpretation is based on a poorly trusted empirical model.
Both unlit and dark data are subject to surprises that can occur at any time. Dark data can fail when the real world is doing something that was not predicted by the now-obsolete model. Unlit data can fail when the source changes (without warning) how that data is used or whether it even continues to use it.
Both cases demand a routine labor cost of regular review of the data for anomalies and periodical reviews of the validation of the models. Much of this review can be automated, but there is inevitably a need for a human interpretation to study the data and the models to make sure they are still valid for the intended purposes.