In earlier posts, I described the benefits of large data sets to present patterns that can suggest new hypothesis that could lead to new understanding about the world. I call this aspect of data science as hypothesis discovery. Hypothesis discovery is when an observed pattern in the data suggests of a question to ask. In my view of data science, hypothesis discovery must be followed by independent hypothesis testing (with controlled and well-documented experiments) before it can support a decision making. The key value of hypothesis discovery is the identification of a new possibility about a fact in the real world.
In the context of discovering new hypotheses that can lead to learning new things about the world, I argued that model-generated or preconceived assumptions can get in the way. Often our data collections are incomplete. Sometimes we fill in missing data with model generated data that represents our best guess what the data should be if it had been observed. When model generated data becomes part of the data set, the apparent patterns in the data become informed by the models and assumptions. The resulting hypotheses can end up focusing on our assumptions instead of exposing new knowledge about the external world.
In my earlier posts, I used the term dark data to describe model-generated or assumed data where observations were missing. I noted that this usage is contrary to normal practice but I still prefer my usage to relate to the popular notions of cosmological dark matter and dark energy that are astronomers are confident must exist based on theory despite the lack of any direct observations to confirm it. Most data science projects lack the funding and broad scientific community to rigorously challenge assumptions. As a result, using assumed data for everyday data projects can be hazardous in misleading policymakers.
The current Ebola crisis provides a good example of the danger of accepting assumed data instead of demanding actual observations. In recent days, there have been a number of Ebola infections in unlikely circumstances.
In news reports for how Eric Duncan (the Ebola patient in Dallas Texas who recently died) acquired Ebola, the explanation of his infection was his act of helping carry an Ebola patient although his role was to support the legs while others handled the upper body. There was direct contact with the body and so we assume that is how he acquired the disease. This is an example of dark data: we know he acquired the disease so he must have got it from somewhere, we know he carried an Ebola patient, and we conclude that this must be how he acquired it. This conclusion does not include any evidence that the part of the body carried in fact had any fluids that could carry viable Ebola virus. Certainly it is possible that that part of the body carried was in fact contaminated, but we don’t have that as evidence. All we know (at least all I know) is that he touched an infected person, not that he touched infected body fluids. The presentation of the data that he carried an Ebola patient is an example of dark data, it is data that conforms to our theories because the theory provided the data in place of the missing direct observation.
It gets more interesting when we learn that at the time Eric Duncan carried this patient, he did not suspect Ebola. My understanding is this victim was a pregnant lady who was house bound recently making it unlikely she could have come in contact with Ebola. We do not know how she got the disease, but we seek to find some link with our theories. The theory of transmission is that there must have been some direct contact despite the unlikelihood of that contact. Instead of describing her acquiring of the disease from “some unknown cause”, we describe it as “some unknown contact”. Direct contact is how transmission occurs, so if there is a new case we can be sure there was direct contact even though we have no observations and even despite testimony that such contact was unlikely.
The theory of Ebola transmission requires direct contact with still warm and wet fluid from another Ebola patient who has recognizable symptoms of illness. Although we lack the direct observation of the initial transmission for most cases, we can assume we found the answer as soon as there is a way connect the patient to having direct contact with an earlier patient. This assumed substitute for an observation reinforces our notion that the disease can only be spread by direct contact. We are told this is the best we can do in generally impoverished and low-educated areas, despite evidence that these people are very well educated about the disease and modern health care as illustrated by this example.
This explanation of ignorance and carelessness has less credibility when we move on to cases where Western-trained and well equipped people acquire the disease. Recent examples are the NBC cameraman doing journalism in Liberia, a doctor (I lost the reference) who had a reputation of diligently following careful practices, and now a nurse aid in Spain who followed high protocols in a well-supplied well-funded hospital during the treatment of a known Ebola patient. In each of these cases, the normal transmission of the disease of direct contact with infected body fluids are very unlikely. But in each case, I learn of after-the-fact constructions of stories that usually suggest some accident or minor mistake. The cameraman might have had some water splashed on him when disinfecting something, and the nurse might have touched her face while removing her protective garments. I haven’t heard of a reasonable explanation for the doctors but the theory is probably that such doctors are likely very fatigued to the point where some kind of mistake could have happened. In each case, we fill this missing data with a story that there must have been direct contract. Once we identify the story, it becomes the data point that confirms our theory that only direct contact with symptomatic patients can transmit the disease.
I recognize that this is a reasonable theory backed up with much research about the fragility of this virus to survive outside of the body. However, from a data-science point of view, this assumption-derived story-telling to explain specific modes of transmission offers no additional information. Instead it reinforces our preconceptions by adding to the number of cases confirming the mode of transmission.
When I read about the later explanations from the NBC cameraman and the nurse in Spain, I was reminded of the kind of admissions obtained after lengthy uncomfortable criminal interrogations. Initially, their story was that they followed careful practices and made no mistake, but later we learned that they might have recalled something that might have been a mistake. That change in story evokes the imagery of criminal interrogations and mild tortures to obtain an useful admission to confirm our suspicion of direct contact with still-viable Ebola virus. Neither of these stories makes much sense to me, because the virus should be dead by the time suggested.
I’m inclined to believe the doctors and journalists who work in known hot zones are extremely careful. There are likely to be even more careful because they have first hand observations of the severity of this disease. Careless mistakes can always happen, but in these circumstances such explanations should not be our first presumption, placing the burden of proof on finding some other explanation. However, the reconstructed stories satisfy the experts and there is no alternative observation to contradict their judgement.
My observation is that these interrogation admissions is still dark data using my definition of substituting model-generated data for the absence of direct observation. Much more satisfying evidence would have been direct objective observation of the disinfectant spraying by the cameraman, or the actual instance of the nurse touching her face with the protective garment. I hope even the investigators into these cases would prefer an unambiguous objective record of the actual act that caused the transmission.
In all of the above cases, we have western-trained professionals deliberately entering an Ebola hot zone. They were trained of the dangers of becoming infected and took reasonable precautions to avoid transmission. However, they did not make an attempt to collect observations of their own activities in case they contract the disease. Just as we have protective garments and disinfectants, we have technology to video record their actions for the entire period they are in the hot zone. High quality video-recording technology is cheap and sold to consumers. In each of these cases, and especially in the case of the nurse treating the Ebola patient in a hospital in Spain, we could easily video record the person’s actions. We should make a standard practice of continuous video recording of all persons acting in an emergency response role in an Ebola hot zone.
It appears to me that this concept was not even considered. Our confidence in our theory assures us of two things: that the person in the hot zone will take appropriate precautions, and that if an infection does occur it must be due to some accident or mistake. That confidence saves us the extra cost of video-recording the very dynamic activities of the persons operating in the hot zone. It is not trivial to keep a video focused on the contact-prone activities of care-givers, journalists, and those who disinfect the premises or dispose contaminated material. However, such recordings can be invaluable to reconstruct what happened and learn more about the hardiness of the virus.
It is especially disappointing to learn that the nurse in an well-funded hospital was not filmed for the entire period from when she entered the protective garments to the point when she was fully disinfected. That video would be invaluable to tell use whether she really did touch her unprotected face with something that could have been infected. If that did happen, we can trace back to when that part of the garment became infected and measure the amount of time elapsed and the virus’s exposure to dehydration and damaging light. This objective data can tell us more about where the virus may be found (such as from sweat instead of blood or feces) and how long it can tolerate dehydrating and cold conditions. It is equally possible the nurse’s original testimony is accurate and she followed the established protocols and still obtained the disease.
In addition, if we had a routine practice of recording everyone operating in a hot zone, we can compare the nurse’s actions with others who might have done something comparable and not become infected. We can contrast the cases to see what made the nurse’s actions (if she made one) more dangerous than the others. We may have sufficient observations to give us doubts that this action could have been the cause because we have abundant other examples where the same thing happened without transmission.
My past complaints about using model-generated data were based on the fact that model-generated data prevents us from discovering something new occurring in the world. The first goal of data science is to discover new hypotheses (that require subsequent testing) about the world. We can not discover new things about the world when we substitute preconceptions for missing data. My practical experience was working with machine-generated data. Humans designed and built the machines, and humans are responsible for deploying and operating these machines. In this environment we should expect no surprises at all, and yet I encountered surprises on a weekly basis. In some cases, the surprise was an unexpected failure condition, but in most cases the problem was a simple matter of slow communication: I obtained observations before I obtained explanations for those observations. The latter examples allowed me to form at least the hypothesis that I should have had the explanation of human planned change before I obtained the observation. My point here is that hypothesis discover is valuable even in the well-controlled world of operating a known inventory of high-quality manufactured machines. Even in this environment, the observations of the world will surprise us and teach us something we did not previous understand or at least that we did not previously think was as important as it turned out to be.
In the case of an Viral epidemic, we should expect surprises. Ebola virus occurs over a vast territory and survives in a wide variety of animal hosts. In this circumstance, we can expect that there may be multiple varieties of the virus including new varieties we have not yet studied in laboratory conditions. Even if we knew all of the varieties, we should expect new mutations to occur that could act differently. The current outbreak is remarkable because it appears more virulent than others. Clearly something has changed, either in the virus or in encountering populations whose practices are more prone to transmitting the virus.
Assuming that the virus is the same we know, the affected population is different than previous outbreaks. In particular, this outbreak is occurring in urban areas with better access to western practices and education. I suspect it is at least possible that modern practices and education may be contributing the spread of the virus.
It is possible that our understanding of the virus may in fact be exacerbating the spread of the disease. Our recommendations for avoiding transmission may unexpectedly be spreading the disease. Our attempts to avoid the disease may increase the odds that we will get the disease. The evidence of highly skilled doctors and nurses getting the disease strongly suggests a hypothesis that our presumed best practices may in some ways be worse than traditional or ignorant practices.
An example occurs to me in the documentary of an Ebola victim I introduced in an earlier post. In particular, one of the pictures shows the burning of bedding of the victim. This burning is a modern practice informed by research that says heat destroys the virus. I’m not sure it is a traditional practice, and I suspect many traditional practices would preserve such belongings without burning (perhaps isolating the material for a period of time that incidentally would also kill the virus). What bothers me about this photo is the meagerness of the fire. The fire is small and fueled by light material that will not burn hot or long. Even if all of the infected material is burned, the burning is slow and not very hot. While heat may kill the virus, the smoke from the fire may include material that was not burned or not very highly heated. This recommendation to burn infected belonging may provide a path to launch viable viruses into the air in microscopic droplets that can settle on someone’s body. This can put the entire community at risk, or even neighboring communities who unfortunately downwind of the fire. While heat may kill the virus, fires like the one depicted may not be the best source of heat: perhaps leaving the bedding out in the sun for a few days would have been more effective.
I very much appreciate this documentary because it provides a wealth of information that gives a more complete understanding of what happens when disease occurs. We need more documentary evidence just like this. Every interaction with an infected patient (or the community with an infected patient) should be documented closely so we can observe common conditions that lead and do not lead to infection.
We need the opportunity to discover a hypothesis that practices informed by our best theories may actually be making matters worse. We can never learn that lesson if we keep substituting theory-based explanations for otherwise mysterious transmissions. The theory itself may be the problem.
Update 10/13/2014: Reported yesterday is this news of the first transmission of Ebola in US discovered to occur where health professional treating the first US Ebola patient (who contracted the disease in Africa). For now, I note that there was an immediate official declaration that this must have been due to a breach in protocols (in other words, this health worker made some mistake) but later they admitted that they do not know this happened: “At a subsequent media briefing, Frieden said the healthcare worker had not been able to identify a breach in protocol during his or her contact with Duncan.” (quoted from article). It was irresponsible to jump to an official explanation of a breach of good protocols without any evidence to back that up. Much more importantly is the fact that they do not have evidence because they did not extensively video-record the health provider during the entire periods when that provider was in the zone where he could come in contact with the virus. The lack of such recording suggests a poor protocol. The entire case suggests that we are not taking seriously the possibility that the protocol itself may be helping the virus propagate.