Paying attention to data and predictions teaches the lesson to suspect models

In my recent posts (such the last one), I wrote about the need for human decision makers to use judgment when considering the evidence in making a new decision.   Certainly, the decider must consider the evidence, but we expect him to challenge that evidence with his experienced judgment to recognize the ignorance that lies outside of the evidence or hides within it.

I value experience in decision-making.  Those who are in a position to make decisions for groups should have experience to back up their decisions.   Experience comes from paying attention to events as they occur.  As new information arrives, experience comes from comparing that information with earlier evidence, assumptions, and ignorance that led to the current circumstances.   Paying attention to history as it progresses takes time.   Coincidentally, the experienced observer becomes older.   This experience is more likely to be found in older leaders simply because they had more opportunity to observe to both the justifications for the decision and the eventual consequences.   Regrettably there are only a few people who pay close attention, but those few are the ones we most welcome as decision-makers.   Or at least I welcome them.

“I am not young enough to know everything”  Oscar Wilde.  The recognition of ignorance is one of the lessons of experience.

In my last post, I attempted to distinguish two categories of decisions.  One category includes the criminal courtroom experience (in USA) where there is an overriding need to avoid risk of punishing someone who is innocent.  For these risk-avoidance decisions, we require proof beyond a reasonable doubt and we are very selective about what evidence can be considered.   In particular, we reject fears, ignorance, and assumptions (model-generated data) as evidence.   A few public policy decisions can also demand such a high standard for proof.  As I mentioned here, apparently that includes deciding whether bans of same-sex marriage are constitutional based on how well such bans are supported by hard evidence and evidence alone.    On the other hand, it does not include decisions such as whether to impose travel bans from countries with uncontrolled Ebola epidemics.   Some decisions demand such high standards and others do not.

The second more common category of decisions concern planning decisions and these decisions do permit a lower standard of proof comparable to the legal preponderance of evidence, and a broader consideration of evidence even evidence of ignorance.

The less controversial form of ignorance evidence comes from model-generated data that substitutes for missing data.  The ignorance comes from that fact that the needed evidence is missing.  However, we readily accept the authority of our sciences to compute what the data should be if an observation were possible.  This occurs so readily that we frequently equate model-generated data as an observation.  Certainly, the model-generated data becomes submitted as evidence to share the same data stores that contain actual observations.   I have written many posts on how I distinguish the data’s ability to capture what happened in the real world.   I arrange data along a line where the best data is what I call bright data of well-documented, well-controlled (thus unambiguous) observations of the real world, and the opposite form of data is what I call dark data that substitutes model-generated assumptions for otherwise missing observations.  Although we frequently accept model-generated data as a valid substitute for missing observations, I continue to label this as a form of ignorance.

The more controversial form of ignorance concerns fears and doubts about our ignorance and in particular the problem of the unknown unknowns.

In past posts, I repeatedly claimed that we can automate the decision-maker if we do not require decisions to consider fears and doubts of the unknown unknowns.   We can assign some statistical measure to the evidence (including dark data, and known uncertainties) and then select the best performing decision based on computed probabilities and confidences.    The reason why we retain human decision-makers is because we value the role of human judgment in decision-making.  That judgment goes beyond the evidence and considers fears and doubts.   We don’t automate decision making because we understand there we are always ignorant of information and that missing information may be critical for the decision.   Judgment considers fears and doubts in the missing data.

The complaint about fears and doubts is that it is immature.   The youngest children have abundant fears and doubts that they later learn are unreasonable.   We dismiss fears and doubts as a sign of immaturity.  But we embrace the concept of judgment or wisdom.   One interpretation of wisdom or judgement is one of extensive knowledge covering every aspect of some domain of knowledge.   An analogy may be the wisdom of a grandmaster in chess who so thoroughly understands the game that he can recognize opportunities and risks given any particular arrangement of pieces on a chessboard.   This is pure knowledge about what moves are possible and the effectiveness of various tactics or strategies.   We could expect this depth of knowledge from our decision-makers and call this depth of knowledge wisdom and judgement.   However, computer scientists have demonstrated that purely computational algorithms can play chess to compete with chess grandmasters.  Chess does not require a human to play at grandmaster level.   If all we expected from decision makers was extensive knowledge, then in this age of big data and statistical analytic algorithms we could automate decision-makers and obtain more consistent decisions without risk of human foibles.  Or at least, we can expect such automated decision-making to be available soon.

I believe we expect more from wisdom and judgment from decision-makers than just extensive knowledge.   In my opinion, wisdom and judgment is the ability to recognize and challenge ignorance behind the different proposals.  That ability to recognize ignorance underlying arguments comes from experience of seeing good arguments fail in practice.

Ignorance can come in the form of recognizing the possibility of hidden variables, the things we have not yet recognized as contributors to an outcome.   We known nothing about these possibilities because we don’t even know the possibility exists.  These are the unknown unknowns.  Experiences teaches us to expect surprises.   I described this type of ignorance as being expressed as fears and doubts.   The decision-maker should be able to recognize the fears and doubts in the arguments of proposals presented to him.   Also, the decision-maker should have his own fears and doubts about missing variables in otherwise strong arguments in favor of some recommendation.

Ignorance can also hide in the preconceived assumptions that provide data where observations are lacking.  I described this as dark data that appeals to science or math to supply answers when we are unable to see these directly from nature.   While the science may be solid, ignorance remains in the fact we are unable to measure it directly.   At a minimum, the decision-maker should recognize the superior relevance of direct observations over substituted model-generated data.

We start life with ignorance, we then replace ignorance with knowledge through youthful education and training, but then we learn a mature form of ignorance as we observe what happens when we attempt to put that education and training to use.  Appreciation of ignorance is an important component of wisdom and judgement we need in good decision-makers.

Although the above discussion focuses on the decision-maker, I believe the same learned ignorance is essential for good data science in general.   I am using my own concept of data science as part of the human experience with historical sciences (interpreting historical evidence).  Note that my definition contrasts with the currently popular definition of data science as a subset of computer science involving the implementation of databases and fast algorithms to analyze data.  My concept of data science focuses on the data and takes the implementation details as a given.   Before the age of computers, we were reliant on our senses of sight, touch, smell, etc, in order to interpret evidence, but we interpreted that evidence without spending the bulk of our time studying what made these senses possible.  We simply took the senses for granted.   Similarly, we should take for granted that databases and analytics will exist as new senses for the human mind.  My definition of data science is how we intellectually engage these new senses.

The focus of data science should be the data.  In an earlier post, I described a taxonomy of different types of data and used the metaphor of brightness to describe the relevance of the data to the real world.  A negative metaphor for brightness may be ignorance.  Bright data lacks ignorance, while dark dark substitutes for ignorance.  That discussion did not include the additional ignorance underlying fears and doubts because those do not typically occupy space in databases.    Although data may include special designations for missing data such as null values, fears and doubts involve missing dimensions — entire variables or categories are missing for us to consider.

My motivation for describing different levels of brightness (or conversely ignorance) of data was to justify the need for intense labor investment in the data science skills of scrutinizing and challenging the data.   The more obvious scrutiny is to check for the things we know can go wrong.  For example, some sensors may fail either to stop creating observations or start creating erroneous data.   The less obvious scrutiny is that contradictions that are observed over time.   An example of such a contradiction is to observe that direct observations previously unavailable are not consistent with the model-generated data we previously used in its place.  Another example is some updated observation that begins to contradict earlier interpretations.   This second form of scrutiny requires more intense labor because it requires careful and continual observation.   We need to look at the data every day and challenge ourselves to reconcile our current interpretations with our earlier interpretations.

We must challenge historic interpretations to explain current data in order to discover the ignorance of our earlier interpretation.   We gain wisdom or judgment from this combination of  two observations: our earlier confidence of an interpretation, and our later recognition of ignorance in that interpretation.   This learning of of wisdom (learning of ignorance) takes time and effort.

I often noted that the way to learn data science is to do data science over a long period of time because the real lessons occur rarely when something goes wrong that contradicts our confidence gained by a long stretch of success.    The training of both data-scientists and decision-makers involves a lengthy process to observe the occasional surprises that contradict previous preconceptions.   Both occupations benefit from regular practice of careful observation and of comparing past interpretations with current ones.   Over time, such practices accumulate a deep appreciation of ignorance that we should consider when making new decisions or interpretations.

In the following, I will illustrate these ideas using the example of the current Ebola crisis.

In several recent posts, I share my thoughts after reading news reports about the Ebola virus epidemic in West Africa.  My primary goal of these posts was to illustrate my data science perspective as it applies to the data pulled from popular news reporting.   I have no expertise in diseases, epidemics, or medicine.  I am only making observations about the reported pubic reports about the data.

I have been paying attention to Ebola news for a period of a couple months but already I’m observing contradictions where previous interpretations do not seem to explain current information.

For example, I just encountered this article that suggests the Ebola epidemic in Liberia may be declining.  This confirms earlier reports such as here that states “the rate of new infections in some areas has slowed down”.  I have seen this reported in passing in many articles.   In recent weeks the documented new cases of the disease are not increasing as rapidly as initially expected.

In addition, we learn that Nigeria and Senegal have recently been declared to be disease free after previously having cases in their countries.  The fact that two Western African nations have succeeded in stopping this virulent form of Ebola contradicts the frequent claim that these nations have practices that are unable to handle this kind of disease.   If the rumored trend for declining cases turn out to be true in Liberia, this will be even a more startling conclusion because the earlier reporting was that Liberia’s public health infrastructure faced even higher challenges.

These trends contradict the alarming warnings that this epidemic will soon be infecting 10,000 new patients every week with approaching to 1,000,000 cases by the end of the year.  Such a catastrophe could be averted if Liberia in fact is getting the disease under control.

For now it is prudent to continue planning for the worst case scenarios.   There is some reason to suspect that rate of new Ebola cases in Liberia are being under-reported in key areas where the health systems are overwhelmed.   It is premature to discredit the models of the virus’ spread and our inability to control it.

However, it is fair to observe that there is at least some reason to doubt the these worst case assumptions.  The disease may not be spreading as fast as thought and the West African nations may be more competent in controlling the contagion than we imagined.   Earlier we were assured this was potentially a global catastrophe in the making.  At least the news that reached my humble attention suggested near certainty of millions of deaths within a few months.  The epidemic is going to have to spread with geometric growth to the the projected levels.   Recent observations are not confirming that geometric growth.

I use the above example as an illustration of observing the data instead of simply consuming it.   A competent observer compares the current interpretations of new data with the previous interpretations of older data.   Certainly the models will improve over time.  However, our observations teaches us that our earlier models were deficient despite our earlier confidence in those models.    If by the end of the year we do not see 10,000 new cases per week and that more countries are declared to be free of Ebola, then we must remember that we had not predicted this outcome as being very likely.   Remembering our earlier confidence teaches us that our earlier confidence was misplaced.  This is what I mean about learning about our ignorance.

Another example concerns the deadliness of this particular disease.   I recall an earlier article that explains Ebola’s deadliness as a result of the virus initially being able to proliferate without detection by the immune response.  The immune response occurs only after massive cell death in organs and by that time the alarm signals to the immune system are so intense that the immune response over-reacts.  In many cases death is due to the consequences of this over-reaction by the immune response.  Here is another similar explanation.

This explanation suggested to me that this evasion of the immune system would mean that there will be no symptoms of illness until the virus proliferates to the point where it almost too late for the immune system to do anything about it.   Consequently, I imagine there being no mild illness of an early immune response that is attempting to stop the virus before it extensive causes tissue damage.  The virus is able to evade the initial immune response and so any mild symptoms would be experienced only briefly as part of the building over-reaction of an already widely proliferated virus.    By the time symptoms are noticeable, it is already too late to prevent extensive tissue damage and it will take a long time to eradicate the virus from the body.   I admit this interpretation is based only on the news reporting that may be overly simplified, but that is the model I am using as I observe new developments.

Recent news reports that the NBC cameraman, the Nurse in Spain, and one of the Dallas nurses (and now the other nurse as well) have been found to be virus free after a relatively short treatment period.   In each of these cases, the patients were diagnosed for Ebola only after self-recognizing the immune-response symptoms of illness.   Using the above model, I expected these symptoms would not have been noticeable until the virus has started doing extensive damage in the body.   While it is possible that our treatments are very effective, it seems most likely that these patients benefited from treatment before the virus proliferated to the point of massive cell death and producing a cytokine storm.   Each of these patients self-reported illness symptoms before being diagnosed for Ebola.  That feeling of illness indicated that the Ebola virus had already triggered their immune responses at an early stage before it had a chance to do much damage.

The model may still be relevant by suggesting that any initial immune response would ineffective against the virus, but I still observe the contradiction that I was initially led to believe that the immune response would have been completely evaded until it was too late.  These numerous examples suggest that the immune response is active well before triggering an immune-system over-reaction.   Comparing recent reports with my earlier interpretation informed me of some ignorance about how Ebola affects the body.   That ignorance may be a consequence my limited access to information from the necessarily over-simplified narratives required for publication, but it seems to me that the logic of an immune system’s potentially lethal over-reaction strongly suggests that the immune system somehow missed its earlier opportunity to stop the virus.  If that were true, it seems unlikely that all of the above patients would not have been so readily cured.

The two examples are related because we are told that the disease is contagious only when the patient starts showing symptoms.  If it is true that the disease evades the immune system long enough to widely proliferate through the body, then any sign of a symptom would indicate that virus is so widely proliferated that all of the patient’s body fluids would be highly infectious.  But the examples of the quick cure (complete elimination virus) of symptomatic patients suggests that the virus was not very plentiful when the symptoms first appeared.   People walking around with flu-like symptoms may not be as infectious as imagined.  If this is true, then perhaps we are over-estimating rate of spread of the disease and the geographic regions at risk for contracting new cases.   The disease may  only infectious during the later stages when the symptoms would immobilize the patient.   Walking around with early flu-like symptoms may have very low risk of infection (unlike common flu that can be infectious a day before symptoms appear).   There appears to be an early immune response that occurs occurs before the disease is infectious.   This observation suggests that the epidemic may be easier to manage than predicted by the current models.

The above examples are not representative of actual decision making or data science because I am working only with public information available from popular news media.   Undoubtedly the professionals have access to far more extensive data and elaborate models about both aspects of the disease.   Although the above discussion raises some suspicions about several models about Ebola (how well West Africa nations can cope with the epidemic, how quickly the epidemic will spread, or how ill symptomatic patients actually are), my primary motivation for this post is to use popular news reporting to illustrate the concept of learned ignorance.

By paying close attention to the new observations and comparing them to past interpretations, we are able to gain a better appreciation for our ignorance.   Our models may be imperfect.  There may be previously unknown variables that can have major consequences for the future.   We need to pay attention in order to observe the failure of new information confirming old expectations.   Observing new data that contradicts previous interpretations teaches us respect for our ignorance.   In particular, such experiences develop a mature sense for fears and doubts to apply to future decisions.


6 thoughts on “Paying attention to data and predictions teaches the lesson to suspect models

  1. Pingback: Model-generated dark data contaminates our data stores with outdated information | kenneumeister

  2. Pingback: Truth as a confounding variable that interferes with interpreting data | kenneumeister

  3. Pingback: Dedomenocracy in action: forecast and response to DC snow event of 2/17 | kenneumeister

  4. Pingback: Appreciating biblical stories as proto-journalism | kenneumeister

  5. Pingback: Paying attention to data and predictions teaches the lesson to suspect models | Hypothesis Discovery

  6. Pingback: Model-generated dark data contaminates our data stores with outdated information | Hypothesis Discovery

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s