In a recent post, I criticized the reluctance of the big data community to be more proactive in identifying ways to employ big data techniques to help control the Ebola outbreak in western Africa. Instead the bulk of their attention is on selling their wares for value added for business gains. When it comes to saving the world, they are more likely to be looking for ways to address some remote danger possibility such as finding asteroids at high risk of hitting Earth, something that occurs every few thousand years or so.
Currently, we have a very immediate and very certain crisis in an out-of-control epidemic for a deadly virus for which we have no practical cure nor vaccine. Given this circumstance, the most effective defense is to protect the healthy population from contracting a virus that is not being spread from afflicted persons. The epidemic is still in the early stage when there is a huge population of healthy people who are at risk of contracting the disease. The urgent need is stopping the spread of the epidemic to healthy individuals.
The population of healthy individuals is huge. If this population were in a more affluent area, there would be abundant opportunities to obtain meta-data that can support analytics to identify useful recommendations. Some possible big data discoveries may include:
- communities that are at high risk of contracting the disease
- local customs or traditional practices that exacerbate the spread by making contraction simpler and more widespread
- identification of path of transmission from animals to humans
- identification of next outbreak so that emergency care centers may be set in advance to handle the expected cases
The excuse for lack of big data participation is that the available data about the population is not enough to constitute big data. In the affected areas, there are only a small minority who have access to social media applications that can provide a source for big data. The data that is available is biased because it comes from the affluent and urban dwellers who are primarily distant from where the disease is spreading. Analytics for effective responses to this epidemic needs data on the entire population, not just the tiny minority of people who currently are active using mobile or social media applications.
Although big data can not help if there is no data to analyze, data scientists can get involved by identifying opportunities that would be possible if the data were available, and by committing to delivering those results if data were to be available. With that information and commitment, there will likely be motivation to supply the emergency infrastructure to collect this data.
So far, it appears that most efforts to apply data science help governments respond to treating the sick and safely handle the deaths and hazardous waste. Certainly, there is a lot of value in saving costs or optimally allocating very limited resources to best handle the outbreak. However, I believe there is much more potential value in finding ways to keep the much larger healthy population from contracting the disease.
For example, one of the severe challenges of this particular disease is that it aggressively spreads and attacks the body’s organs in a way that does not exhibit any recognizable symptoms. Apparently healthy people may in fact be infected with the disease already causing great damage to their bodies. Often by the time the symptoms appear, it is too late to save the damage already done by the virus. Big data can help by identifying likely disease transmissions before the symptoms appear.
An ideal application of the highly promoted predictive analytics is in identifying the afflicted before they develop their symptoms and become contagious. Big data on healthy populations has the potential for providing that kind of predictive results. Given the advertised efficacy of predictive analytics, this is the ideal opportunity to use it for social good.
The recent case of a first Ebola case emerging in US illustrates this lost opportunity. While the exact details of this specific case are still emerging for this case, the case has demonstrated our planned response for handling such cases. The CDC response relies heavily on something called contact tracing that identifies individuals who may have been exposed to the risk of contracting the disease because of contact with the patient. From my remote perspective from reading only news accounts, I see a major problem with this plan in that it appears that contact tracing is very labor intensive. For a single case of Ebola, we hear of teams of investigators being dispatched to trace the contacts and then of commitment of team to monitor for the next 21 days those identified as being possibly infected. I observed officials strongly asserting very precise conditions required to be identified as possibly infected. While I recognize this was partly to calm population fears, I also heard that a more broad definition would overwhelm the contract tracing response. Between the time this individual became infected and the time Ebola was diagnosed, the chain of contacts can easily reach millions of people across all continents due to his use of multiple airline flights and terminals.
CDC’s claim that contact tracing must be used judiciously implies that it must be rationed. The current approach for contact tracing appears to be very labor intensive from a very small pool of trained investigators.
In this era of big data, it is clear we should be able to query existing data to identify every transit and facility this infected patient used and every individual who used the same facilities after the patient used them. There are not enough health care resources to monitor this population for the next 21 days to detect possible transmission of the disease.
Unfortunately, I fear this is going to happen. In an earlier post, I discussed the phenomena of customers demanding direct access to 3-Vs of big data to make their own decisions. People are increasingly aware of the abundance of big data in modern society and the availability of technology to rapidly query that data. At the same time, people are observing a disintegration of accountability from leadership especially when those leaders explicitly excuse themselves by being overwhelmed by the data. The inevitable consequence is that people will demand access to this data to query for their own purposes. In this example, there is no reason why everyone can not learn the minimum the number of contact separations between them and this patient. There are millions of people who are now connected to this patient and each of these have some probability of contracting this virus. I readily admit that this probability is minuscule, but it is certainly not zero.
The government fears what will will happen when millions of people suddenly demand check-up appointments for their impression of an unacceptably high risk of catching the disease, or what will happen when this population starts showing up in emergency rooms at the first sign of body discomfort.
We live in the age of big data, and there is no escaping that people will learn from big data their surprisingly close degree of contact with an infected patient. This alone is a call for a need for a counter-balance of predictive analytics to provide more credible prediction of risk compared with the currently human approaches using written checklists more appropriate for early 20th century medicine than today.
The CDC’s modern response resembles the depictions of some family-practice doctor making house calls in the 1940s. It is really no wonder that the above patient was sent home due to mild flu symptoms after his first visit to the emergency room. Medicine still respects the proverb that if you hear sounds of hooves, think horses instead of zebras. The expectation that most people have what most people have does not apply when facing an epidemic where most people will have what most people will have.
As noted, current big data opportunities will inform people of their degree of contact separation of the infected person before symptoms appeared. What we lack is modern big-data justified predictive analytics to provide reassuring numeric probabilities and confidence levels of risk of contracting the disease. In this day of big data, it is hard to accept CDC’s assertions of zero risk based on heuristic checklists of subjective observations. We need hard statistical predictions of risk based on real data that combines our degree of separation from the patient with historical observations of similar degrees of separation when infection does occur.
Personally, I accept the CDC argument that their heuristic approach does in fact capture this prediction based on actual experience with this disease. If I develop flu-like symptoms tomorrow, I will not conclude I have Ebola. I accept this because I trust CDC and it will be accountable if they get it wrong. However, I have some doubts in my trust in CDC. Given the data available to CDC and the subjective observations that go into their heuristic risk model, their confidence seems too high of a risk of zero.
We need better predictive analytics that can account for real objective data such as who used the same facilities as the infected person, how close they were to touching the same surfaces he touched, or who may have contacted people who had those contacts. In order to satisfy the goal of the prediction to identify likely transmissions for the disease, we need data from where the disease is actively spreading. We need data from Africa. That data does not exist.
Another appeal for calm from government is that US (or similarly affluent societies) are not like the affected areas in Africa. Certainly, the combinations of customs and lack of wealth create unique circumstances that are not replicated here. On the other hand, the affected areas are very diverse with many customs where each may have some overlap with the conditions in affluent nations. For example, one consequence of the diversity is the animosity between competing traditions that may result in examples of a patient’s handling involving better sanitary practices (not necessarily for that patient’s benefit) that may suggest unexpected routes of transmission.
My earlier Ebola post presented a challenge to the data science community to direct their valuable tools toward the problem of containing this epidemic. That post implied that this participation is optional. With this post, I think the employment of big data predictive analytics is not optional. This disease will spread to affluent areas where people will learn of their degree of contact separation from the infected individual. We urgently need predictive analytics to inform these people of data-verified quantitative (small but nonzero) risk of contracting the disease given that degree of contact separation.
The problem is that Big Data is a paper tiger when it comes to the crisis of keeping healthy people from getting of Ebola in Western Africa. The linked Wikipedia article references a quote by Mao Zedong:
In appearance it is very powerful but in reality it is nothing to be afraid of; it is a paper tiger. Outwardly a tiger, it is made of paper, unable to withstand the wind and the rain. I believe that it is nothing but a paper tiger.
He was directing his comparison to the United States at the time, but replace United States with Big Data and replace “nothing to be afraid of” with “nothing to count on in a crisis”. Big data is a paper tiger when it comes to an urgent humanitarian crisis like Ebola that involves a huge population supposedly that Big Data is best suited to address. The quote suggests the paper tiger can be blown away by wind or disintegrated by rain, but I’d add that a fire would turn the tiger into an inferno. The fire option captures what is happening with the first reported case of Ebola emerging in the United States: big data facilitated the awareness of the degrees of separation from an infected individual and thus raising alarm and risking panic (that is already turning in a minor inferno in terms of airline stock valuations) but big data offers nothing of constructive value to counterbalance the panic.
We don’t have abundant information on healthy populations in the region that is most exposed to the epidemic. As a result, we do not have a lot of observations of exactly how people acquire the disease. In earlier posts (such as this one), I spent some time distinguishing types of data. The best data are direct observations that are well documented and controlled. The worst data (for observing the world) is when we replace missing observations with model-generated substitute data. I named these two extremes as bright data and dark data, respectively, to illustrate this spectrum. Big data can help in a crisis like Ebola if we had bright data of actual observations of healthy populations that over time have some members acquire the disease. Instead, all we have is dark data from the CDC assuring us that the virus can only be transmitted through direct contact of body excretions of a patient already exhibiting symptoms that may at first appear indistinguishable from a flu. In addition, the CDC assures us that sanitary practices of keeping a safe distance and frequent hand washing can prevent the spread. This is all dark data based on scientific understanding of the disease learned from laboratory or from a few case studies. We are making decisions about Ebola based on our potentially incomplete understanding of how it should behave.
Model generated data is data that applies our preconceptions where we have no data. The preconceptions may be based on scientific analysis but that analysis requires meeting statistical tests that could lead to overly optimistic perceptions. For example, the statistical test for confirming modes of transmission of a virus may require meeting a test that shows that the observations did not come from chance. In this case we may simply lack sufficient evidence of other modes of transmission to overcome the statistical test. It is fair to say science fails to prove other forms of transmission, while at the same time it may be unfair to use this information to replace missing data.
Unfortunately, that is exactly what we are doing today. We see a patient with confirmed Ebola infection and assume based on our science that that patient must have come in direct contact with some excretion from an infected patient. In fact, we have no direct observation of the exact mode of transmission. Just yesterday, a news report announced that a news cameraman has tested positive for Ebola. I don’t know the exact details but I assume that because the crew is being sent back to USA, they are citizens of USA and they have western education. They probably understand the science of Ebola spread in addition to their general education on hygiene. This was a cameraman, a person whose job requires distance from the filmed subject. While it is possible he performed some physical work in setting up a scene for a good camera angle (which would be dishonest for a news report), good journalism practice would suggest he would keep a distance from the subject being reported. I am very suspicious that this cameraman picked up the disease by the usual claims of unhygienic local traditions. It is dark data (model-generated data) to assume that he must have contacted sweat, vomit, or feces of an infected patient. The bright data is that a western cameraman as part of a western journalism team acquired the disease. However the transmission occurred, it did not occur as a result of the cameraman’s ignorance.
This one report of a western-trained journalist cameraman acquiring a disease is a bright data point. Even if he came in close proximity to an afflicted patient, he most likely avoided doing anything that would put him at risk. Again, my information is limited, perhaps he did do something either foolishly or accidentally that put him in risk. In any case, this is a bright data point.
To enjoy the benefits of pattern discovery and predictive analytics of big data, we need many more bright observations like this one where we know the conditions of the patient before he was infected and what he was doing when the transmission may have occurred. The truth is that we do not have this data so we instead fall back scientifically proven modes of transmission and assume that that must have happened in this case.
Assume for example the cameraman was actually involved in an activity that involved direct contact with an Ebola patient. The science would assert that this must have been the path for transmission even if the actual transmission was something unexpected such as touching a dry surface that the Ebola patient previously touched. We can not tell from this one case that this happened. However if we had data on a large number of cases of where a larger number of people came in close proximity to an Ebola patient, we may observe patterns that show many incidents of transmission involving individuals with less contact and many incidents of no transmission where was direct contact.
Big data analytics offers its greatest benefit by finding patterns in actual observed data that is not contaminated by preconceived or scientifically validated assumptions. Although science gives us confidence that direct contact with infected symptomatic patients is a path for transmission, it does not give us similar confidence of ruling out all other paths. All that science can tell us is that we lack data for other paths that would overcome our statistical tests that the results did not come by chance. With big data, if that data were available, we begin to observe patterns across multiple scenarios where the patterns are inconsistent with the established theories. At a minimum, this data can at least raise doubts of how well we understand the infectious nature of this disease.
The impediment for enjoying the benefits of big data analytics is not only that we do not have the data, but that we will never have that data. We are in the middle of an immediate humanitarian crisis involve large diverse populations. Confronted with this crisis, big data shows up as a paper tiger, unable to help because there is no helpful data and worse contributing to the crisis by exacerbating a panic with degrees-of-separation contact data that is abundantly available. The paper tiger doesn’t bite, but it does burn.
We will never obtain the data needed for big data to deliver its highly advocated benefits. In the above case of the cameraman, we are not certain the patient will admit to doing something foolish (if that occurred) especially if it implicates him in journalistic dishonesty or malpractice. In the earlier case of the first reported Ebola patient to become symptomatic in US, there is good evidence to suggest he did not fully disclose his contact with a gravely ill patient when he boarded the flight or when he first visited the hospital.
Although these two cases may involve some withholding of information, the real problem is that these are just two cases. The benefit of big data comes when we collect data over large populations. Even if we assume pessimistically that every individual will be dishonest, they will be dishonest in different ways because they have different objectives for their dishonesty. A journalist may be dishonest to hide professional misconduct, while a person wishing to fulfill his vacation plans may be dishonest to assure passage on an airline trip. Combined over large population, the individual motivations will tend to cancel each other out. Truly big-data of the entire population can permit us to discover new patterns that suggest other ways of transmission or how the transmission changes depending on the severity of symptoms of the patient.
Again big data can not help with this crisis because we will never have this data, not today nor any time in this century. In my earlier post, I suggested a philanthropic approach of distributing solar-recharging smart phones with low-earth orbiting cell stations to connect most of the population of Western Africa to social media sites that can begin to collect data that we may use to learn about the progression of the disease. Technically, it is feasible we can flood the area with technology and begin analyzing the data.
The problem is that the population will not participate. If they do not participate, we will not get data we need. After writing the earlier post, I realized an error I made by assuming that it is only the lack of wealth that discourages participating in social media on mobile phones. That same post listed some cultural problems of distrust where they suspect malevolence in our generosity to help them treat patients and stop the spread. This suspicion is an extension of justifiable suspicion they have of malevolence from members of anyone outside of their communities or tribes. This is a region of unstable governments with frequent rebellions or internal fighting.
The people most prone to acquiring and spreading the disease are also the people most prone of encountering violence or abuse from their neighboring communities. Even if they had access to social media technologies and saw a benefit for their own needs, they will understand that this data will be available outside of their communities. There is a risk that the data can fall into the hands of their local enemies.
The entire goal of big data in this context would be to identify practices that separate different communities and that have different outcomes in terms of the spread of this disease. Even anonymous aggregated data about communities could expose vulnerabilities or transgressions that the community needs to keep secret. Also, this data will persist for ever, and thus may be abused much later long after this crisis has subsided.
We have a hard enough time convincing people to not attack or impede our benevolent charity in treating patients and stopping the spread of the disease. I don’t see how we can convince them of the technically sophisticated ways that will prevent this data from being used against them sometime in the future and for purposes having nothing to do with an epidemic. We don’t even have that confidence here.
The populations we need data on may be poor, but they are intelligent. They have secrets and justifiable reasons (on local terms) to protect these secrets. They will not voluntarily divulge information such as the case of the first Ebola patient in US not immediately disclosing recent contact with sick individual. They will also go out of their way to avoid exchanges of information as suggested by the physical attacks on hospital aid workers, or the raids to return sick family members to their homes.
The current offering of big data is that we have plenty of data to fuel a growing panic and no data to assist in slowing the spread of the disease. When it comes to the Ebola epidemic, big data is a paper tiger.
This article provides some details about the village left by the man who brought Ebola to US.
All the cases, including Duncan’s, appear to have started with Williams, though some wondered how a pregnant woman who stayed at home could have contracted Ebola. Maybe it was her boyfriend, who hasn’t been seen in weeks, they said. Or could it have been her close friend known as Baby D, who has since died herself?
And also this quote,
“Does anybody know the taxi number or the license plate?” one man called into the crowd. “We need to find this vehicle!”
These are the kinds of observations that we are lacking: observations that suggest we do not yet understand all modes of transmission for the disease.
Update 10/27/2014: BBC News article presents several examples of where data scientists are getting involved. These are mostly about leveraging existing data instead of identifying potential analysis that would require new investments for better data. For a disease like Ebola, we need data to get deeper into poorer areas, they are less likely to have things that leave data trails. The article concludes
So it’s probably too early to say whether big data analytics is having a meaningful impact on the rate and spread of the disease, but at least it is helping us decide where to allocate our resources.
I agree that it is probably not having any impact beyond helping to allocate resources. The analytics ideas are hampered by the lack of data, and the ideas presented are not compelling enough to identify what kinds of data would be worth the expense to obtain.