Labor intensity dealing with dark data

I count myself among those who are tracking the evolving story of the missing Malaysia Airlines flight 370 and for the same reasons others are.

I have mixed views of the news coverage.  It is annoying to see news reporting as facts the latest hypotheses, and sometimes by uninformed experts.   On the other hand, about the only way I am learning about is because news has figured out to spin a story of the day to satisfy their market.

I appreciate following the evolving investigation.   I watch with great interest the entire cycle from the slow emergence of new information changing the theories resulting in new places to look for more data.   The progress allays my fears that it may end up being completely unsolved, although I still fear finding out what exactly happened.   As of today, a possible culprit of lithium-battery fire came up and that is scary because lithium batteries are everywhere.

But in addition to the specific case of the missing flight, I also follow this as a student of data science.   This is an extreme example of the resources required to fill in missing data.   And by resources, I mean in particular the manpower involved.  Mind-boggling is the number of individuals the extraordinary breadth of expertise that are employed in this investigation.

Even some automated data such as the satellite data required human analysis to work out a solution for a problem the data was not originally designed to provide.   I don’t know the details, but I would guess someone did some quick ad hoc analysis rather than clicking on some pre-existing report.   This single fact alone impressed me.  My congratulations either way: either to the person who worked out the ad hoc solution or the designer who previously had the foresight to prepare a readily available tool for this kind of event.

This is an example of the what I’ve been trying to describe as the dark data problem.   Missing data becomes dark data when we run out of alternative data options for defining it.   In analogy to the above scenario, it will become dark when all leads are exhausted and the entire event falls into murky explanations of competing theories, conspiracies, or fantasies.   I still have hope that that can be avoided.

On a personal note, one of the more damaging criticisms of my prior work is that it required a lot (and to some an unbelievable amount) of manual effort to fill in the data that I went to great lengths to identify as missing.   I guess the criticism could either mean that I could have been better scientist to come up with more automated approaches or that I was unnecessarily diligent at exposing the problem.  In any event I took the criticism personally but constructively as motivation to find better ways.   I take for granted the inevitability of that particular project encountering missing and that missing data will always require a lot of manual work to address.

Eventually I began to realize that this is really not unique to me or my little job.  I was working with historical data.  Data from history.  There is a profession out there called historians.  They’ve been around for centuries.  They deal with historical data all the time and the vast majority of the data they want is missing.   Entire careers are spent filling in even in a small gap either through finding direct evidence or piecing together enough collaborating evidence to provide reasonable confidence in the result.

Working with historical data is probably always the same.   There may be some processes that are so clean and well designed that they can report historical data without any risk of missing data.   I suspect it is rare and probably would be even rarer if the designers spent more time thinking about what can go wrong with data.

It is great to think of IT systems as automation, as labor saving devices.   But when it comes to analysis of historical data, it seems there is an inherent limit to automation until we can figure out time-travel.   What can not be automated has to done manually.  For historical data, the manual work is tedious and slow and made all the more so by the need to coordinate with multiple external experts.

At the same time, this work can not be ignored.   In the case of the missing flight, we all very much want to know the exact fate of this particular case.  In my experience, the motivation was mostly to find a root cause so that the problem can be avoided in the future.

One of my earlier posts was about taking history seriously.   I think it could be rephrased as taking historical data seriously.   We must work hard to identify missing data and even harder to exhaust all possibilities to support a conclusion of what was missing.

I think this issue is vastly under-appreciated by the big data community.   In most cases, the bigness of big data is because it is historical data.  Historical data almost certainly has missing data especially as it spans long time periods.   It is great to automate map-reduce algorithms for snapping queries and reports, but the data science job doesn’t stop there.

A far bigger task is to track down and resolve the missing data.  It is a tough job, but a necessary one if you really take history seriously.

Update on March 25: This presentation of the latest information on the missing flight impressed me.  The report emphasizes that it was a mathematical innovation.   I agree math is involved but its focus overshadows the greater effort of exploiting available data and applying it to a specific problem.   It appears to be a strong conclusion but even if it is later contradicted by other evidence, I direct an applause to the individuals who identified a way to use existing data to apply to an unanticipated event, and then put in the effort to work it out.


3 thoughts on “Labor intensity dealing with dark data

  1. Pingback: Hard Sciences vs Soft Sciences | kenneumeister

  2. Pingback: Labor Intensity, big data vs law | kenneumeister

  3. Pingback: Economy of compensated opinions in a dedomenocracy | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s