In this series of blog posts, I repeatedly invoked the term Dark Data to refer to data is made up to fill in for data we wish we had direct observations. I used the term dark in the way astronomers use it in talking about dark matter and dark energy. We know something is missing and we fill it in with properties predicted by our models.
I like the term dark because it invokes a sense of caution. It is like the “dark side of the force” in the original Star Wars movie. We don’t know what it is but we know we should respect its relevance. I suppose later movies attempted to explain it more explicitly but as much as I liked the original in my late teens, I never felt inclined to watch the sequels. Perhaps it is best because what lingers in my mind is the dark side’s mysterious and inexplicable nature.
This dark data thinking also comes from my background working with big data where I found myself often defending my investment of time chasing down the missing data and of endless ways analysis algorithms can gloss over them. Over the span of writings, I converged on a definition of dark data being data that is generated by models instead of by direct observations.
At times, I suggest that dark data is a deliberate choice. An example is wanting to know yesterday’s daytime high temperature by looking at hourly temperature measurements but discovering a missing measurement at 2 pm. It is possible the default algorithm is to return the largest value of the available observations. In our haste, we may just run with that. A slightly more cautious approach would be to recognize the missing data is between the neighboring measurements. We could still report the original results but just add an asterisk to note there is some missing data. It is more likely though that we will make some kind of assumption to fill in what the temperature must have been based on the available data. One approach would be to fit the data to a curve using a best fit algorithm tells us with considerable confidence that the temperature is higher than either of the neighboring temperatures. We will report that number and remove the asterisk because we trust the model. The dark data is that curve-fit predicted value being used as a replacement for an observation and its dark side is that it changed the answer we were seeking. Our confidence based on our our models can end in embarrassment when someone reports that around 1:55 a brief thunderstorm produced a micro-burst of cold wind that knocked out power for a while. Well, at least I would find this to be embarrassing.
It becomes a secondary task of data science to diligently seek out dark data and then isolate it so it can be watched carefully so it doesn’t embarrass us.
Continuing the metaphor, the opposite of dark data is bright data. I talked about this indirectly in my discussions of how I see the sciences divided into two families: the present-tense science interested in collecting and documenting carefully controlled observations, and the past-tense sciences (historical science) that is interested in carefully evaluating and re-evaluating historical data to come to the best reconstruction (or model) of what happened. Bright data is the goal of the present-tense sciences and the treasure of past-tense sciences. The brightest data is accurately documented for what the data measures and carefully controlled to measure exactly what is documented, nothing more or nothing less. Most observed data is at least a little dim, but at least it is still dim bright data.
Dark data is instead the absence of any brightness at all. There is no data at all, or there is some random observation laying around with no documentation. Dark data is a tool used by historical scientists. Sometimes we use it explicitly to make a more easily comprehended narrative: for example to identify a common ancestor not found in the fossil record but explains two species sharing an adaptive trait. Often, we use it implicitly as in the case above where a model (a curve-fit algorithm) predicted what the result would be and in fact our protocol is always to use the best-fit result predicted by the model.
Personally, I find dark data to be very scary to the point where it keeps me up at night.
I’m more comfortable working with aggregates of observations. Instead of focusing on questions about individual-specific observations (which may be contaminated by dark data), I like to work with summaries of large groups of observations based on some shared attribute. The attribute is designed to be a bucket that accepts a broad range of individual values in part to include more individual observations.
In the above example, one attribute may be a time period such as the afternoon hours of 1 PM – 6 PM, and another attribute may be average temperatures in predetermined ranges such as 75-79.9, 80-84.9, etc.
The goals shift from studying individual records to studying their aggregates: for example, to study the aggregates over multiple days or seasons.
To make the problem a little more realistic, consider that the temperatures come hourly from hundreds of weather stations distributed in a broad metropolitan area. One of the consequences of the broad categories and large data is to overwhelm any isolated dark data measurement with lots of bright data measurements. We still have an incentive to seek out and prevent dark data from entering the data, but the occasional introduction of dark data observations have diminished consequences because we are aggregating results from lots of independent weather stations.
It is tempting to say that working with aggregations solves the dark data problem, or at least reduces our concerns about dark data.
But aggregates are really just another type of observation. When dark data exists in the aggregate, it dims the brightness of the derived observation of the aggregate. In some ways the aggregate complicates the problem because it is easier to overlook the consequences of a dark data. In the example above, the storm affected a large part of the metropolitan area on a Friday but to varying degrees of temperature changes and times of power restoration. If the aggregate is a single value for the entire afternoon for the entire metropolitan area on a particular day, we are likely to use that value in our broader studies of trends (especially since such reports tend to be automated).
The results may suggest something is special about Friday temperatures. I’ll be a little silly to suggest that we may find this result valuable because of our prejudices about what happens on Fridays, the last day of the workweek. In real world examples, this kind of jumping to conclusions that reaffirm suspicions is very likely to occur, often without even questioning that we may be jumping to a conclusion. In this case, it seems obvious that the day of the workweek would have an influence on urban temperatures.
It takes a lot of investigation to study the data to track down that one Friday that corrupted the results, and then track down the contributors, and finally show how such low level missing observation can account for the broader and misleading conclusion about Fridays.
I mentioned in an earlier post about high-intensity labor demands for dealing with dark data. Often times, we simply don’t budget for that kind of human labor. We let things coast.
There other ways for aggregates to bring their own versions of dark data.
One way is in the design and definition of the attributes and the bounding values of their categories. Ideally the category boundaries should capture observations that are strongly related to each other. An example is a scatter plot of individual measurements with well defined clusters of dense measurements separated by large gaps of very few measurements. Ideally the category boundaries should follow the outlines suggested by the gaps instead of slicing through the middle of the cluster.
The earlier example of very arbitrary definitions of temperature bins and definition of afternoon hours is common practice. There is no knowledge about how the underlying data is scattered with respect to the category boundaries. The information may be available but it is very difficult to research. In my example of temperature ranges, there may be clustering depending on the cloudy conditions of day: sunny days cluster around one set of temperatures while cloudy days cluster around another.
If we choose to set boundaries based on observed clusters that change over time, then we have two options. One is to use different category names for the different boundaries used to track the migrating clusters, so that a particular category such as 82.5-91.8 is only used for a few days. The other option is to replace a easily interpreted category name (based on explicit temperature ranges) with a more vague or relativistic term such as warmest, coolest, and mid-range clusters. Whether we ignore the clustering of the data or try to adapt the categories to the migrating clusters, we cannot escape the difficulties of interpreting the data. The attribute boundary definition introduces some level of dimness or darkness. The boundaries may be so inappropriate that the attribute is analogous to dark data: telling us more about the model than about the data.
Another way for attributes to become dark data on their own is when the attribute is not directly associated with the observations. In the above example, we may really want to study the patterns of cloudy and sunny days over a period but we only have temperature measurements. We may choose to use temperatures as a proxy for cloudiness. This is a silly example, but this kind of substitution often occurs in very subtle ways. Often what we are really interested in is not directly observed or may not even be observable. We nudge the interpretation of the observation to approximate what we are really interested in. This is dark data in the aggregate. Like all dark data, it deserves constant labor-consuming diligence to be sure the assumptions were reasonable and remain reasonable. But like most aggregate data projects, this type of labor is often not funded.
My last example for a way for aggregations to have a dark side is the problem of the missing dimension. By dimension, I mean one attribute and its collection of categories. In multi-dimensional databases a dimension may be a collection of related attributes. For this example, either definition works because I’m referring to a whole concept left out of consideration. Even if this can be reduced to a single attribute it likely belongs to a complete different set of attributes. For my simplistic temperature measurement example, we don’t have measurements of cloud-cover or humidity content. Our search for patterns in temperatures may find patterns that are actually better explained by cloud cover or humidity. Because we didn’t include these dimensions in the analysis, we can be mislead into explaining the patterns based entirely on the temperatures.
This final example is part of what is driving the expansion of big data efforts to use ever more dimensions. We recognize there can never be enough dimensions to associate with measurements. What is exciting is that the technology is allowing us to add more dimensions than we have observations to support. The pressure is on finding more observations or enriching data to populate more dimensions.
As with other dark-data, the problem is less about technology and more about labor. High intensity labor is involved in identifying a missing dimension, a reasonable observation to populate that dimension, a reasonable way to associate that dimension to the base observations, and the continued diligence to assure that these conclusions remain valid.
Often for big data projects, the available budget is underfunded for the necessary labor-intensive activities to properly manage the dark data, not only in the raw observations but in the aggregations of those observations. We tend to treat big data projects as technology problems where most of the investment is in getting the right technology in place. We overlook the costs of past-tense science of diligent investigations of historical data to have confidence of the results.