In earlier posts, I chose the word dark data to refer to accounted data where corrupted or missing data fails to account for something we assume must exist.
In my experience, I worked with data that often was missing some data points. From the very start of the project, it was recognized that data will always be somewhat incomplete. My approach was to work with what was available and report the data as is. I did measure how complete the data was but only used it as a grade of quality: the data was graded as green, yellow, or red depending on how complete it was. This worked for my purposes because my focus was on large aggregates based on mapping individual records to a smaller set of categories. My products were on the analysis of totals for these broad categories representing large populations of individual records. At least for data graded as greater than red (very poor data collection) the results would be useful for interpretation of larger trends.
A future topic will cover the issues of dark data at the mapped-reduced aggregate level where the darkness comes from errors or omissions of categories rather than of underlying data.
For this topic, I want to concern another use of the same data at the individual record level. As I noted, the data was prone to omissions. There was a measure for how many records were missing but no direct indication of where the gaps were. Thus direct interpretation of individual records left ambiguous the absence of a data point representing a missing data point of a case of no event to record. In this scenario, the latter interpretation was very useful for understanding a particular problem so that there is a bias to prefer to interpret missing data proof that an event did not occur rather than the fact that the event occurred without being recorded.
This is the scenario that concerns me when I hear about bulk data collection and searching by government agencies. They are definitely looking the individual record level. Even if they know the data is incomplete, they may have an incentive to assume that a lack of a record means some event didn’t occur to generate a record rather than the event occurring without being recorded.
The possibility of missing data does not invalidate their efforts, but it does add a level of burden on analysts to carefully study the data to rule out data errors as the problem.
There are many sciences, especially those related to natural or human history where it is well understood that the evidence is very incomplete and that interpretation of available evidence needs to be done very carefully. Even when the evidence is highly scrutinized the results are usually presented tentatively with an acknowledgement that it may be wrong. Although there are statistical methods and automated tools to simplify the process, the analysis typically requires a lot of effort by trained and experienced scientists to verify the data.
Big data encountered in modern businesses generally is historical data and has similar problems as other historical data involving missing data, ambiguous data, or erroneous data. The trouble with big data is in its name: big means it is too much for a human to scrutinize each individual record. Useful results of big data involve algorithms that create filters that are usually based on some particular case that is of interest of catching when it recurs. Given the motivation to catch a recurrence, there is an incentive to be more permissive so as to not miss the recurrence even though it will also flag instances that are unrelated to what is sought.
A filtered data set is a smaller data set and perhaps at this point it can become a manageable task for a human to interpret. But in practice, this filtered set is given a designation of suspicious requiring further analysis. Everything caught by the filter is essentially suspect until its relevance can be ruled out. Also in practice, the initial filtered (but not analyzed by a human) data set is summarized immediately into dashboards for managers to have situational awareness of at least the potential of something happening.
It is the automation of this level of data that bothers me. Historical data and especially second-hand data is prone to a lot of problems that can result in misleading interpretations. Even with the most careful and skillful handling, the results are likely to be as tentative as the earlier examples of the natural sciences usually involving months or years of analysis.
The goals of bulk data analysis in government have much more immediate applications. The value of these efforts is very time sensitive. This biases the algorithms toward jumping to conclusions before eliminating all possible confounding issues with the data itself.
I believe big data can be useful for policy making where time permits careful analysis working with aggregates of large categories over large time scales. I have more concern about using big data for tactical purposes for immediate actions based on almost entirely automated algorithms. I acknowledge the temptation, but I doubt the value.
The following example is an attempt to describe how dark data (invented data to fill a gap) can mislead.
Imagine a neighborhood where a neighbor has compiled a list of neighbors who agree to share their contact information. The list is helpfully sorted by street addresses. The houses are numbered even on one side and odd on the other side. The even numbered houses on the list are:
The list shows a pattern that suggests two houses were missing: 222 and 226. This suggestion is what I call dark data.
Perhaps there are two houses but they are unoccupied or the owners do not with to share their contact data. This reasonable assumption can raise some interest by the other neighbors.
In this case, it is easy to test by walking the neighborhood and looking at the house numbers. All of the houses are on similar plots and houses 218 and 230 are immediately adjacent properties. There are no unoccupied houses or secretive neighbors.
But there is still a question as to why would the numbers be skipped. The mere existence of the gap demands an explanation. Perhaps we suppose this is recent neighborhood that replaced an older one that included duplexes on the lots currently numbered as 218 and 230. A possible explanation was some flood that condemned all the properties and all they were all rebuilt but the duplexes were replaced by single family houses and that’s how they got numbered. This is another example of dark data. It also raises suspicions: perhaps the neighborhood is still at risk of a flood.
Again, this is easy to verify by checking with the local government’s real estate division to find the original surveys of the properties. Such a check shows that there was no prior neighborhood, that the numbers showed up in the original submission of the paperwork but with no explanation of the missing numbers.
Another dark data explanation is that the original developer was just superstitious about having a house numbered 222 and thus avoided the even 20s. The neighborhood is very old and anyone involved in that paperwork has long since retired or died. We end up concluding that there is no reason to be concerned about the missing numbers.
The point of my silly example is that a characteristic of dark data is that it requires some manual research to go beyond the provided data set to find more information to test suspicions raised by the dark data.
Manual research is labor intensive and takes specialized knowledge of where to look for answers and how to decide whether the answers are adequate. Dealing with dark data takes time and effort that is consistent with long term policy planning and decision making, but it can be contrary to the expectations (or even justifications) of some programs that desire to exploit big data.