In previous posts, I described dark data (that I define as model-generated data) as a form of ignorance. We substitute dark data for missing observational data. Missing observational data may be occasional gaps resulting from a sensor occasionally failing or there being no sensor in particular locations. Often the missing data may be entire variables that lack any practical form of sensor. Our theories or understanding of the situation may alert us to the necessity that such information does exist even though we have no way to measure them. With that certainty, we compute predictions of these values based on other measurements plus our understanding or assumptions on how the desired variable relates to these other measurements.
I distinguish all forms of model-generated data from direct observations. Even minor data smoothing (in attempt to remove noise or random errors) substitutes model-generated data for observational data. Another minor form of model-generated data is a simple interpolation between two observations to estimate an intermediate observation. These occur so frequently that a major part of our source data is model-generated substitutions for observations. While I recognize the need for models to massage data in preparation for analysis, I prefer to have at least one source of data that contains only the actual observational data uncontaminated with any assumptions. That data store is the closest representation we have of what is happening in the real world.
I want to distinguish direct sensor observations from those observations that we force to conform with our assumptions. I want access to a data store that is free from any prior human assumption. I recognize that the very construction of sensors involves some degree of assumptions of how the real world maps to measurements. This is even true with our natural senses that evolve over time in an attempt to make physiological signals correspond to the real world. My goal for a pure observation data set is to at least prevent further contamination of a valid observation (or a valid missing observation) with assumptions about the world. Those assumptions should come later.
This demand for unmodified observational data is especially important when considering the big data project involving large varieties of data. In simple studies of a single variable, there is little harm in applying some smoothing or interpolation. This same operation can become a problem in big data. For a large variety of different types of data, we need some reliable way to conform the data by identifying ways to match corresponding data from different sources. When the goal is to find patterns among different types of observations, we need to be certain we are matching the observations correctly.
If I have a measure of temperature of a cooking surface and a separate image of a pot of boiling water, the observation that the cooking surface caused the water to boil requires the assurance that the temperature was measured at the same place and time as the image of the pot. Even for direct sensor observations, this conformity of different data sources is a challenge. In a previous post (here), I described the challenge of conforming even the time stamps of different measurements: the time stamp may be the start of the observation, the end, or the time when the observation was delivered to the data store. Simple concepts such as smoothing or interpolation confuses the time stamp because the recorded value actually includes information from past and possibly future observations.
For the example of interpolation, the observations occurred at different times but we conform the interpolated time stamp with some other observation. In the boiling water example, perhaps I had a measurement of the heating surface at 10:00 and at 10:30 where both indicated cold temperatures and the pot’s image was taken at 10:15. We may interpolate a cold surface temperature at 10:15 in order to conform to the image’s time stamp and then conclude that cold surface temperatures can boil water. Given the increasingly precise analytics we attempt with big data, this example is not an exaggeration. We need to know precise times of each measurement so that we match the correct observations, or we recognize uncertainty that results from a missing observation. Big data analytics (of large data variety) can easily be mislead by model-generated data that superficially makes each single-variable data look better when examined alone.
Most of my complaints about model-generated dark data is that it substitutes our assumptions for what we failed to observe in the real world. This substitution defeats our goal for discovering new hypotheses about the real world. When given model-generated data, the correct conclusion of the analytic algorithm is to confirm our assumptions without adding anything to our understanding. To discover new truths about the real world, we need to avoid contamination by our assumptions.
The above argument about contemporaneous data is really a separate complaint about dark data. In contrast to confirming our assumptions, mismatched time stamps can lead to conclusions that both contradict our assumptions and mislead us about the real world.
In the simple single-variable smoothing or interpolation example, the confusion of time-stamps is minor. These are manageable as long as we are aware of their effects. Either we design algorithms to take these errors into consideration, or we recognize some loss of confidence in the analytic results. Many cases of dark data involve much larger scale time blending. This occurs when dark data is the result of computer simulations. The simulations take time to set up and to run. The simulation results are presented long after the input data were observed.
While I appreciate the value of simulation results for planning-type decision-making, I do not want simulation data to contaminate the observational data. Simulation results are strongly influenced by our assumptions. The motivation for simulation is to apply our understanding on some recent data to see what more we can predict. Simulation results need to stay out of the observational data because it will only support our prior assumptions that may in fact be wrong.
In my last post, I gave some examples where recent observations about Ebola appear to contradict earlier assumptions about this disease. These observations may be obscured if we allowed the prior assumptions to smooth the observations. This obscurity may be happening in the case of explaining recent reports of the declining or stabilizing number of new cases in Liberia as the result of a collapse in health-care capable of reporting these cases. A major justification for this explanation is the contradiction of the models that predicted the number of cases should be much higher. I also described this tendency to trust the model despite observations when I discussed (here) the attribution of human error in following protocols to explain how first-world employed hospital nurses became infected with the virus. In those cases we also lacked observations but came to the conclusion of human error primarily based on our confidence in health care protocols.
For this post, I want to focus more on the problem of conforming the time-stamps of simulated data with observational data. In an earlier post (here), I distinguished the useful role of simulation for planning purposes to predict future outcomes from the disruptive role of simulation to provide substitute data for missing observations. In that post, I distinguished the two as simulation results and dark data. Both come from models, but the distinction is how we use the results: simulation informs planners of future possibilities, dark data provides data where we lack observations.
An inherent problem of dark data is that it inserts as a current observation a result computed from much older information. This introduces that problem of data that is not contemporaneous with other observations. This mismatch of time interpretation making conformity with other observations very difficult, or as misleading as my example above of interpolating a cold cooking surface under a pot of boiling water.
In my own experience, I recall many cases where a simulation effort based on a set of observations from two weeks earlier would be challenged to show relevance for something that happened more recently. Often, current events have made the old observations irrelevant. The assertion of relevance of the simulation results implicitly claims the simulation results do represent what is happening right now. If a simulation predicts the future in 2 week iterations, then the first iteration would be the current time at the time the results are presented. The information that the simulation claims to be a current observation will need to be reconciled with other observations. In this case, the simulation’s conflicting claims for relevance should be rejected. Even if its assumptions are sound, the data it uses is old data and no longer relevant.
The latest real-world observations take priority over the simulation results. Or at least, that is my opinion. It appears that in practice we frequently assign priority to the simulation results over the observations so that we suspect something is wrong with the observation. Sometimes we implement this formally as part of the data cleansing process that rejected forbidden data because it doesn’t meet expectations. This priority of models over observation is how we can declare human error for unintentionally violating protocols despite the individual’s observations that they followed protocols correctly.
A recent example of simulation data concerns the Ebola crisis. This announcement presents simulation results that more resources are needed in Liberia to prevent a rapid expansion in the number of Ebola cases and deaths in the coming months. Predicting the number of resources and their optimal placement is a valid use of simulation and modeling. We know we need more beds, and sanitary supplies such as sterile gloves, aprons, etc. We know we will have to allocate limited resources to the locations where they can best head off the growing epidemic. The simulation results like this can help in that planning.
I object to the article’s claim of current facts. The article starts with the assertion
Without a massive scaleup of aid, the Ebola epidemic in Liberia will explode, according to a computer simulation published Friday of the nation’s most populous county.
I didn’t read the study so perhaps this is the opinion of journalist instead of the study authors. But it presents the simulation results as a fact. There will be a explosion of new Ebola cases without massive new investment in aid. This is using the simulation as a source of data for the missing data of not knowing whether the epidemic will rapidly grow or is starting to stabilize. The simulation model intentionally set out to expect a rapid increase in cases in order to project how many supplies will be needed. It is incorrect to use this result to claim as a fact that the increase will in fact occur without that aid. We don’t know what is happening.
The following statement by Alison Galvani restates the message with more uncertainty while still asserting the urgency for immediate action:
Although we might still be within the midst of what will ultimately be viewed as the early phase of the current outbreak, the possibility of averting calamitous repercussions from an initially delayed and insufficient response is quickly eroding.
We might be in the midst of an early phase of a much larger outbreak, or we might be seeing the end of an outbreak. From what I can tell, we do not know this answer.
The call for more aid to acquire resources is a valid suggestion for planning purposes. But this is distinct from establishing the fact that we are currently in an early phase of the a much larger epidemic. The early phase of the a larger outbreak is an assumption required for running the model, not a given fact. The study makes a point of estimating a high-side R0 of 2.49 (instead of 2 or fewer) and that conditions observed in one location in mid-September are typical and will not improve without aid.
As I noted in earlier posts, I see evidence that the local government and population are improvising potentially effective strategies. At least in the more urban areas, the people are educated and motivated to do what is necessary to avoid the spread of the disease. It appears this local effort is paying off because as noted in the article:
“Reality seems to have already made this paper outdated, as the numbers of new cases seem to have plateaued in this area in the last few weeks and may even be declining,” Ferguson said in comments to Britain’s Science Media Centre.
There are multiple reports that the epidemic is not spreading as quickly as predicted earlier. These observations deserve more credibility than the simulation results. However, I have seen other reports dismiss these observations as being a failure of data collection. This dismissal implies a preference for the simulation results instead of direct observations. This use of simulation results as competitive to actual observations is what I call dark data. I value the observations more than I trust simulated data for the same facts.
Although I prefer observations over the model results for data about what is currently happening, I still value the simulation results to inform us what further investment might be needed to be prepared if the outbreak does get worse. Clearly more resources are needed and these require more foreign aid. There may be some argument about the model assumptions about the rate of spread and these can refine the recommended aid levels. In any case, simulation is a useful tool to allow us to prepare for the future.
What is happening right at the moment is critical to understand in order to better prepare for the future. We need good real world observations. The available observations are hinting that the local conditions and practices are improving so that the rate of spread may be slowing down. We need more data from real observations in the coming weeks to better predict how much aid will be needed. If in fact that outbreak may be toward its end, we can cut back on our projections for future aid. The foreign aid might better be used for other purposes than supplies for an epidemic that no longer exists.
I am concerned that we will continue to use simulation results as a proxy for actual observations. Actual observations are very difficult to obtain, and even if they are possible they are expensive. It is cheaper to use models to extrapolate overall current conditions from a few observations or from a time in the distant past. It appears we will continue to use this approach in part because we expect conditions to degrade our ability to collect good observations.
I fear this opens up to a kind of circular logic where models predict conditions are so bad that it is impractical to obtain good data so we should continue to use model-generated data to estimate what is happening. Even when field reports contradict the model’s projections, we will dismiss the field reports as being faulty because the conditions are so bad. In a few weeks there will be 10,000 new Ebola cases per week, because that’s the best possible data: data from the models.
Even as we rely on model-generated data to estimate what is happening at the moment, we must acknowledge that these models do require some input data. While there is a scientific justification for extrapolation from a statistical sample, the expected degraded conditions of a larger epidemic is unlikely to produce well well-controlled random samples. Also, the sampled data will be older data that may become quickly irrelevant for a rapidly changing scenario for an epidemic, particularly if the epidemic becomes as bad as predicted. Even as we continue to trust model-generated data, we still need to improve our ability to collect new sample data to keep the models relevant to actual conditions. If we had enough reliable data, we would not need to rely so much on model-generated results to tell us what is happening.
We need more data. I noted in an earlier post (here) that the big data community could become more actively involved in setting up data solutions to collect better data. Ideally, we can collect sufficient data to support big-data analytics, but at a minimum we can obtain better data for our simulations. The big-data technologies can at least improve our ability to collect new data in the areas where Ebola is currently active. From the above article on the Ebola model, there is this objection:
“I’m afraid this is an example of a study performed in too much haste and with too little attention to the epidemiological data being collected in the field,” said Neil Ferguson, a professor of mathematical biology at Imperial College London.
I suspect one reason why the study is not using the most recent epidemiological data is because this data is not in a form that is readily available to the simulation scientists. Assuming that is the case, there may be an opportunity to upgrade this data collection to allow this data to be shared more quickly. The simulation results should incorporate the very latest data, and those results would become obsolete once new data is obtained.
That is like the anecdote I shared above where the simulation results from 2-week old data is dismissed because it didn’t incorporate some newer observation. To be relevant to describe the current conditions, the simulation results need to use the most current data known to the decision-maker audience.
In the case of Ebola data, I suspect the bottleneck is in the data flow from field reports to the form required to run the simulation. This suspicion is consistent with my own experience where preparing data in the form needed for simulations is very time-consuming. The simulation technology itself may be able to run fast enough to present new results in a short period of time. The problem is getting the current field observations into a form that is compatible with the simulation. This kind of technical problem may have a solution with the big-data technologies to promote higher velocity, meaning the faster movement of data from sensors to visualization of analytic results.
I don’t see much evidence we are employing state-of-the-art big-data technologies in the Ebola crisis. Even in the USA, we are using inefficient techniques such as relying on voluntary verbal reporting of analog thermometer measurements for those under observation as a result of contact tracing. There are many opportunities for big-data to engage in the Ebola crisis.
We are relying on model-generated data to substitute for observations that are (or will soon become) impractical to obtain. The resulting models are providing information for current responses based on outdated information. This model-generated or dark data is conflicting with recent field observations and this is causing confusion such as the proposition that the field observations are unreliable. To avoid this confusion, we need better mechanisms to collect and retrieve the latest data to feed those models or even to obviate the need for the models at all for generating substitute data for current conditions.