Distinguishing dark data and predictive modeling roles in decision making

In my last post, I argued that for evidence-based decision-making, model-generated data (dark data) is as harmful as fears and doubts because both are based on lack of observational data.  We accept dark data because of our trust in these models, typically scientifically tested theories.   When we use dark data to take the place for missing data, dark data becomes the equivalent of fears and doubts in terms of being substitutes for ignorance.   In either case, we lack actual observations.

In my last post, I suggested as an example of dark data the invented fact that Ebola will not be as troublesome in USA because conditions in USA are very different than conditions in West Africa.  Until the first case of a patient developing symptoms of Ebola in USA, we had zero evidence that Ebola’s consequences would be any different here than in West Africa.   We substituted our ignorance with models that showed that conditions that make Ebola difficult in West Africa do not exist here, and thus Ebola will not be as difficult here.  When we finally observed an Ebola case, we observed the same fatal consequences for patient and the same 2:1 rate of contagion.   These consequences were not predicted by the model that USA is not like West Africa.

Although on this site I have raised many concerns about dark data substituting for observations, in my last post I made a new point of equating dark data that we readily accept in practice with fears and doubts that we readily ridicule and reject.   My personal opinion is that decision making exists because of inescapable ignorance.   When we have absolute confidence in evidence, we don’t need decision-makers.  With modern technologies we automate everything that has access to complete evidence.   We employ decision makers because we recognize some key evidence is missing.

I argue we need human decision makers because I want human judgement to weigh doubts and fears against the available evidence to come up with a decision.  I want that decision maker to be human so I can relate to his thinking and decide whether the judgement is competent and in good faith.   We have decision makers because we expect bad consequences may happen.  When those bad consequences do happen, we want reassurance that the judgement was understandable.  We want to be convinced we would make a similar decision given the information available at the time.   That persuasion is possible because the decision came from a human peer.

Real world policy decisions are not purely evidence-based.   We do allow models to provide data to substitute for ignorance.   We should also allow a role for fears and doubts especially as a substitute for the unknown unknowns.   That is my opinion, at least.  Take it for what it is worth, which is not much.

As I mentioned in an earlier post, I entered data science from an earlier career in simulation and modeling.  Even though I have been outside of that field for approaching two decades, I still admire the concepts of implementing algorithms that capture current mathematical models of reality to simulate what might happen.   I would welcome the opportunity to return to simulation field for supporting policy-level decision-making.

Despite this history and personal interest in simulation, many of my posts criticize the concept of model-generated data in decision-making.  I chose to call this stuff dark data in part due to the negative connotation of dark.  There is something sinister about model-generated data.   I came to this conclusion after my more recent experience working with data.

That experience involved a larger project that we divided into two parts where one part was a simulation and the other part was supplying data for the simulation.   Inherent in this project were two distinct worlds: the simulated and the non-simulated.   There was a line drawn between the two.   My role was to work with the non-simulated data, and in particular to collect and verify observations from the real world.   I didn’t think much of it at the time, but part of my scrutiny of data was to view simulated data as a contaminant of the data.   Ironically, I learned to be suspicious of simulated data in observational data because I valued the potential of simulation to produce world-relevant predictions.  A necessary condition for simulations to be relevant to the real world is that input data is purely about the real world.

Feeding simulated data into a simulation does not help us predict what will happen in the real world.   In the Ebola example, we were not informed of what will happen when Ebola arrives in USA by substituting data derived derived from the model of USA is not West Africa.   We would have been better off without that model at all and accepted the reality that we were completely ignorant of what will happen when Ebola is introduced in USA.

Thinking back on my experience with simulation and modeling, one of the reasons I adored working in the field because simulation and modeling was an opportunity to influence decision makers.   Historically, simulation and modeling has a mixed reputation with both extremes of successes that encouraged replication of the practice and failures that encouraged abandonment of the practice.   Nevertheless, simulation continues to appeal to my scientific and mathematical sensibilities.  I have always considered simulations as my opportunity to influence decision-makers.

The market appeal of simulation and modeling to decision makers is that it will provide relevant evidence.   At the same time, I describe simulation and modeling data as dark data that should be avoided as evidence.  Observed objective data is superior to modeled data.

This is a contradiction in my thinking.  I celebrate the value of simulation and modeling.  I also describe their results as something to reject.  I think I can resolve this contradiction by introducing tense.

In earlier posts (such as here), I proposed dividing the sciences into past-tense (historic data), present tense (collection of observations), and future tense (persuasive arts).  The first two are sciences and the latter is an art.  I suggested that this would be more useful way to understand science than the current concepts of Hard Sciences (sometimes equated to STEM) and the other sciences such as the social sciences or humanities.   My tense-based proposal shifts the focus away from explanatory powers of theories and directs the focus instead to be on the available data.   I believe this way of unifying the view of the sciences is especially relevant in this age of big data technologies.

I argue that simulation and modeling has a valid role only in the future tense of the persuasive arts.  The future-tense is the art presented to decision-makers to persuade them to make a certain decision.  It is also the art that the decision-maker practices to persuade the population that his decisions are proper.

In this division of sciences, simulation and modeling is completely out of place for the present-tense sciences of collecting observations.  The present-tense science seeks the ideal of well-documented, well-controlled observations of the real world.  The ideal observations are completely free of any ambiguity of what happened.

I call this ideal data bright data.   Ideally, our data stores would include only bright data of unambiguous observations of the real world at a particular time.

The bulk of efforts in what we historically called hard or soft science occur in the past-tense science of working with the observations that were previously collected by present-tense science.   These sciences have the goal of interpreting the data in order to come up with theories about how the world operates.   Although the explanatory power of the theories have varying degrees of success depending on the disciplines (physics are very successful, social sciences less so), all of the disciplines value bright data over dim data (ambiguous observations), and dim data over dark data (invented data).

It is in this world view that I condemn dark data.   Model generated data has no place as input data for any attempt to interpret the real world.

Model generated data properly belongs exclusively to the future tense.  What will happen, not what did happen.   In my division of sciences into tenses, I labeled only the present-tense and past-tense as sciences.  I labeled the future-tense activity as an art.  The future tense activity is where decision-makers operate.  The future tense also describes the goals of simulation and modeling.  We use simulation and modeling to answer the questions that start with “what if”.  Even when they do not provide answers, they do provide persuasive clues.

This is why I labeled the concept that USA is not West Africa as a simulation.   Using the models of experience in West Africa, we observe that Ebola is less of a risk in USA because several critical conditions in West Africa do not exist in USA.   This answers a what-if question for the future tense.  Only direct experience answers what will actually happen.   Our first experience is that in fact Ebola impacts USA pretty much the same as it does West Africa: it can be fatal to the previously unsuspecting patient and that patient will infect two others.

I have still not resolved the contradiction that simulation and modeling results are both good and bad for decision making.   In my redefining of human activities into present- and past-tense sciences or future-tense persuasive arts, I placed decision making in the future-tense.   Simulation and modeling is about the future.  Thus decision-making should accept simulation and modeling as a valid source of evidence.  I have not presented a reason to object to simulation and modeling in decision making.   I have not presented a reason to suggest that simulation and modeling is equivalent to fears and doubts (of the unknown unknowns) because both are substitutes for ignorance.   Decision-making is always about what is the best thing to do in the future.  It is an art with the goal of persuasion.

To clarify the contradiction, I propose that decision making concerning the future can be divided into two categories defined by two different approaches to managing risks.

The first type of decision is whether to accept a risk in the first place.  A good example is the current debate about whether we should ban travel to countries currently identified as having Ebola (that list now includes USA, by the way, at least until 42 days elapse of no new cases).  There is a risk that allowing such travel will allow the virus to spread into the country.  This decision is whether we allow the risk at all.   The risk is obviously present, even if it is very unlikely.   One choice is to ban travel for people with passports from the affected countries, or to not issue travel visas to them.  Such a decision recognizes a risk and takes steps to avoid the risk.   Let me call this type of decision a risk-avoidance decision.

The second type of decision is how to prepare for a risk.  Again using the current Ebola debate, this decision involves considerations of what are the best policies for isolating patients and at risk populations, and the best practices for health workers serving Ebola patients.   An example of this type of decision would be to answer how many beds in Ebola-qualified treatment centers do we need and where do we need them.   Let me call this type of decision a planning decision.

In my more recent posts, my arguments best apply to risk-avoidance decisions.  Such decisions should separate ignorance from observations.   Actual observations (not models derived from observations) are superior to ignorance.  Ignorance includes model-generated substitute-data as well fears and doubts about the unknown unknowns.  The question about whether to allow travel from countries with current epidemics of Ebola is a risk-avoidance question.  It is best made concerning purely the observations: Ebola epidemic exists in some countries, Ebola is contagious, and  Ebola does not exist in our country.   The risk avoidance decision considering only these observed facts can decide it is reasonable to ban travel from people who recently were in these countries.   Introducing modeled data or fears and doubts into this type of decision implicitly makes the decision that we can accept the risk.   That is fair, but the question then becomes the second type of decision concerning how to prepare for the risk.

When making a risk avoidance decision, we should consider only the valid evidence.  In an earlier post, I compared interpretation of data with how courtrooms handle evidence (I further discussed this contrast here).  A good example of a risk-avoidance decision is the legal task of determining guilt for a crime.   The standard of proof beyond a reasonable doubt has the goal to avoid the risk of punishing an innocent.   I would describe this as a risk-avoidance decision.   In law, this type of decision requires very careful selection of evidence to be admissible: the evidence must be relevant, material, and competent.   Of the universe of possible data including model-generated data or hearsay, there is a small subset of data that is admissible in arguments for a criminal case.  This standard has the goal of avoiding the risk falsely punishing an innocent individual.   Certainly, fears and doubts are inadmissible for court cases: we can not find someone guilty merely because we fear them or doubt what they may do in the future.   Similarly, we do not admit simulated data as a substitute for hard evidence, or when we do we demand very high standards on the credibility and relevance of that simulation.

Like in criminal court proceedings, risk avoidance questions should exclude model-generated data as well as fears and doubts.  Questions like the one above of deciding to prevent travel from countries with Ebola epidemics should be based on a restricted set of evidence of what is actually known: a certain country has the epidemic, and the epidemic can spread.

There is however the second type of question about planning for what to do when Ebola arrives in this country.  Even with a travel ban in place, we can expect at some point the disease will appear in this country.  We need to plan for this possibility.   Whether we transport new patients to a few highly specialized centers, or we require every community designate a hospital to be prepared for this kind of treatment, we need to estimate the number of beds needed.

Here, we need models for how many cases there will be, how rapidly it will spread, and where it will occur.  These are all future “what if” data.  There is no current observations that can supply this information.  Even in West Africa countries with active Ebola epidemics, it has proven difficult to estimate how many beds are needed and where they should be.   There are many stories of locations where sudden outbreaks result in the number of patients exceeding the local hospital’s capacity for beds.   With the far higher treatment standards in USA, the capacity problem is likely to be equally acute.

Modeling and simulation (model generated data) is especially helpful for planning questions.   For these questions we are not concerned about the risk of something happening, but instead we assume that something will happen in order to be prepared with the right resources in the right places.

We require decision makers to make both types of decisions: the risk-avoidance decisions that should limit itself to the best possible evidence (similar to the standards used in criminal courts), and planning decisions that should employ more evidence including simulation and modeling to answer what-if questions.  Fears and doubts of the unknown unknowns are also relevant to what-if questions for planning.   Planning questions involve our attempts to be prepared for what might happen.  Risk-avoidance question involve our attempts to be reasonable about preventing something from happening in the first place.   I think it is best to distinguish what is admissible evidence for these two types of decisions.   Risk-avoidance decisions should reject all forms of ignorance-data such as fears, doubts, and model-generated dark data.  In contrast, planning decisions will incorporate ignorance-data because our goal is to be prepared to our best ability in case it does happen.


12 thoughts on “Distinguishing dark data and predictive modeling roles in decision making

  1. Pingback: Paying attention to data and predictions teaches the lesson to suspect models | kenneumeister

  2. Pingback: Model-generated dark data contaminates our data stores with outdated information | kenneumeister

  3. Pingback: Nanoseconds don’t listen to milliseconds | kenneumeister

  4. Pingback: Improving government with frequently updated laws: rule by data | kenneumeister

  5. Pingback: Dark nothing hypothesis macro-sized particles | kenneumeister

  6. Pingback: Dedomenocracy’s nemesis: the innovative criminal | kenneumeister

  7. Pingback: Render to COVID19 what is COVID19’s | Hypothesis Discovery

  8. Pingback: Nanoseconds don’t listen to milliseconds | Hypothesis Discovery

  9. Pingback: Distinguishing dark data and predictive modeling roles in decision making | Hypothesis Discovery

  10. Pingback: Paying attention to data and predictions teaches the lesson to suspect models | Hypothesis Discovery

  11. Pingback: Model-generated dark data contaminates our data stores with outdated information | Hypothesis Discovery

  12. Pingback: Improving government with frequently updated laws: rule by data | Hypothesis Discovery

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s