Dark Data is worse than No Data

In my last post, I suggested we may be making an error by assuming a norm of a steady cash flow from stable employment.  In particular, that lower income people may be making a rational decision to avoid making an annual commitment to a set premium for health insurance.    Some people live in constant uncertainty about their opportunities for future income.

I intended to tie this to my continuing exploration of what I call dark data.  I use dark data to refer to a data point that is missing a direct observation so we replace it with a data item predicted by some model.    Dark data is invented data.   In the above scenario, we invented the idea that low incomes are steadily low.

In earlier posts, I discussed that dark data and hypotheses are in the realm of historical sciences.  Scientists study historical record are perpetually challenging their hypotheses.  They are perpetually suspicious of model-generated data that stands in for missing observations.   I’m talking about the science as a whole.  An individual scientist may be satisfied with the model and the model-generated data.   But eventually his peers or a later generation will challenge those models and model-generated data (dark data).  I’m not saying necessarily the challengers will win, but only that they will challenge it with new ideas or new data.

In an earlier post about education, I suggested what we promote as science in the classroom is the widely accepted hypotheses (theories or laws) and their generated data.   We promote the idea that these are indisputable.  We promote the idea that to be good scientists, we must not dispute the indisputable.

I argue the opposite.  Hypothesis and hypothesis generated data is distinct from observed data.   The practice of science on hypotheses and their derived data is to always question them and challenge them.  Even hypotheses that are elevated to the status of scientific laws are constantly being probed and questioned.

What may be indisputable are observations.   For example, today is day during the spring season so I can assume there is no snow on the ground.   That is open to debate.   But if I open the door and see there is indeed snow on the ground, then the debate end.  The assumption of no snow because it is spring is modeled-generated data.  It is fair game for debate.   The observation of snow on the ground is indisputable (at least until it melts).

My point in the last post was to suggest that we may have committed the nation to a policy for health insurance markets based on hypotheses about the predictability of people’s income.   We assume that everyone is employed, or employable, into jobs with steady predictable incomes that will allow them to honestly commit to a long-term contract of a specific premium.

If that hypothesis is wrong, then we have made a big mistake.

Over the past century or so, we have grown steadily more confident in our abilities to possess highly trustworthy hypotheses.  I’d also argue that we have grown steadily less capable of constructively challenge hypothesis and thus less tolerant of arguing about our hypotheses.

We are more inclined to make major policy decisions based on our confidence in our models.  

I don’t share that trust in hypotheses.

I trust observations more than I trust hypotheses.  

It would not have been hard to go out and gather observations about the community we want to help and what kind of commitments they are willing to make.   Well, it is a little hard because it takes extra effort and expense to collect these observations.   But those observations may have informed us better for making a more successful policy.

Gathering new specific observations is expensive and inconvenient but it would be better data to work from.  

Because of our confidence in models, we are irresistibly drawn instead to take advantage of their convenience and affordability.

My opinion is it is often better to accept the ignorance implied by missing data than to presume that model-generated data can replace the missing observations.

Advertisements

One thought on “Dark Data is worse than No Data

  1. Pingback: Exposing model generated information for public scrutiny | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s