When I worked with data, my job involved quantifying various known categories. Given the nature of the data, we recognized up front that the data was incomplete and the available data could be categorized incorrectly. We adjusted our expectations so that there results were useful within a relatively narrow context of certain planning activities.
One of the ideas to emerge was to come up with multiple categories of “I don’t know”. We may be ignorant because we know data is missing but we don’t know what category it would belong to if it were present, or because we can recognize the data as being relevant even though it can’t be assigned to one of the defined category, or that the data is known to be irrelevant and thus not one of the defined categories. It was useful to have different buckets for ignorance. For instance, these quantities can populate trends to show improvements or degradations of this kind of information. But one thing we tried to make clear was that we were ignorant on this data.
Ignorance about certain data is not dark data. Dark data is concluding something about the data without direct evidence. Often this is a process of elimination. For instance you have to measure some objects using a yardstick. Some of the objects have no values. Dark data would be to conclude that these are larger than the yardstick. It would have been better to assign them to the bin of ignorance and researched for explanations for how it could have been measured. Perhaps the measurement practice was sloppy so that some items got through without being measured, or perhaps the object was too flexible to have a definitive measurement, or perhaps the object was dynamic where the size would change over time. We can find these causes and adjust our measurement processes to include these options in the future.
It is surprising to see how quickly we want to replace ignorance with something we strongly suspect. Often the suspicion just seems so obvious that we don’t even question it, such as the example above with the limited yardstick. There may even be a good incentive to jump to these conclusions when there is a need to make immediate decisions.
In the example, we may want to use historical measurements to figure out if all of the objects will fit in a particular container. In this case it is reasonable to conclude the missing measurements may not fit. This is a different kind of conclusion than a positive measurement of something that exceeds the dimensions of the container. The assumed measurement is dark data.
Dark data can be useful. Continuing this example, perhaps it is impractical to remeasure the items in time to make a decision and there are other options that do not depend on this measurement. We may eliminate the container option because of the risk it may not work. Hopefully, we make this decision with the understanding that we are a certain that it will not fit.
Dark data is often treated with more certainty than it deserves. Polar opposite policies can be justified with different interpretations of the dark data. Even if we are left guessing which interpretation is more likely, we may overlook the fact that neither offered interpretation is correct.
I use the term dark data as an analogy to astronomy’s dark energy and dark matter. What we know is that there is an unaccounted remainder. We don’t know what constitutes this remainder. It is is likely to be something completely unexpected.
Dark data alerts us to the extent of uncertainty. It can only weaken the case of for a particular policy decision, contrary to how it is often used.
Pingback: Evidence-based decision making when dark data contaminates the evidence | kenneumeister
Pingback: Evidence-based decision making when dark data contaminates the evidence | Hypothesis Discovery