In earlier posts I proposed a division of human inquiry in terms of the age of the data involved. My point was to emphasize a fundamental change in our approach to data when the data changes from present-tense operational data to historical data.
I suggested the approach taken with present-tense operational data is best exemplified by scientific experimentation. The experimenter carefully controls and documents the measurements that become data. I suggested that this is applicable broadly to all aspects of our interaction with the physical world in the present. A more general way to describe this type of data is to call it operational data. Even in day-to-day operations, we are concerned about the fidelity of the data with the immediate real world.
Measurement data has a limited period of time when it is relevant for operations, or for interacting with the physical world. At some point, the measurement data is replaced with fresher measurements or we recognize that the measured data is no longer current. Expired data are still valuable as records of what occurred at a particular time, but they are no longer relevant for the current conditions.
Much like the very diligent experimental scientist, we generally invest a lot into having high confidence that the data is representative of the real world phenomena being measured. Our systems built around this data allows us to interact with the world in a way that provides benefits to us. However, in most operational systems we tolerate more ambiguity and lower quality of data in order to optimize cost effectiveness. We are willing to accept some error or ambiguity to lessen human workload (simplifying data entry, for example) or to allow use of less expensive sensors.
The point I want to make in this post is that operational systems are designed to tolerate these less ideal data sources. These systems include feedback mechanisms that constantly cross check the data with the physical world. These cross checks occur in a variety of ways that allow for corrective actions to occur. A feedback involves comparing a calculated prediction with a new measurement. The system uses this difference or error to make corrections in the measurements. My background in electrical engineering included study of control systems that has many mathematical models for this type of feedback.
Most operational systems have error-prone measurements with some type of feedback to obtain corrective adjustments. Feedback loops are not necessarily described as the mathematical feedback models used in electronics.
This concept occurred to me during a recent discussion about bad data in health care data. The discussion focused on technical solutions of creative ever more rigorous controls or rules to prevent bad data from being introduced at all. I observed that while this is definitely an important project, I doubt it is ever possible to eliminate all possibilities of bad data.
Although there has been a lot of recent debate about improving health care, modern health care is very successful in delivery of services. This success suggests that the relevant data must be pretty good most of the time. My observation is that feedback mechanisms are partly responsible for the success. Delivery to health care to a current patient has the benefit of the patient being immediately available to monitor with fresh observations. As a patient moves through their service, there is a continuous rechecking of recorded data with the actual patient.
An example I recall was the very last moment before being sedated for medical procedure. Despite all of the prior preparation, they asked me to state my name and the reason why I was there. This is a feedback loop. They had all of this information. I was in their control from the moment I was admitted, and yet still there is this very simple check. I’m sure there are rare instances where something will go wrong and the wrong person shows up presented to the wrong doctor for the wrong procedure. I’m sure the error gets recorded and evaluated when it is caught. The caught error does not interfere with the patient getting successful treatment. The patient got good health care despite bad data.
The data systems are always improving. The discovered error is corrected so the that final record is correct. Successful delivery of health care service strongly suggests the data ultimately is good data. This applies even when the patient’s condition is not helped by the procedure. The data is reliable.
In earlier posts, I described a taxonomy to describe different families of data quality. One of the data families is what I called unlit data. This is available data that is not operationally relevant. When data has some operational relevance, there is an implication that there is some kind of feedback involved to check that the data matches with reality. In contrast, unlit data is a measurement that has no such check. Unlit data is available just like other data, but it deserved additional suspicion because it has not been verified.
Consider again my experience with the medical procedure where they asked me my name before being sedated. Also included in their medical record was a whole lot of other data such as my assertion I never had chicken pox as a child. This data is included in the same record as my name and the reason for my procedure. The problem is that there was no independent verification or check about whether I had a history of a certain childhood disease. There was no need for that verification because it was not relevant to the procedure. This procedure was successful so all of the data involving my particular case gets recorded as successful outcome.
The data ceased to be operationally relevant when the procedure is completed and I was released. Assuming there is no need for followup, the record is now a historical record. It includes the fact that the procedure was successful for patient with a certain name (confirmed at point of delivery) and with an assertion of not encountering chicken pox as a child.
The earlier mentioned discussion about the problem of bad data was in context of big data for health care. In this context, big data refers to the compilation of all records of past procedures. These historical records are no longer operationally relevant. The patient has long since been released and no longer available to cross check the data.
This is the distinction I’m making between present-tense data and past-tense data. Present-tense data is effective because the subject remains available for new measurements to confirm older data. In contrast, past-tense data no longer has the opportunity to obtain new measurements. The data in a big data repository is historical data.
The promise of big data for health care is that it may lead to opportunities to discover new ways to improve health care or optimize allocation of scarce health care resources. Analysts can query the big data for anything in that data to find new patterns that suggest new outcomes.
In earlier posts about predictive analytics, I described this rapidly developing technology as being able to handle massive datasets and find patterns among many variables that would be impossible for humans to find without those algorithms. The point I tried to make in those posts is that these patterns may involve a large number of contributing variables where no single one of those variables alone would exhibit the pattern. In my example procedure, perhaps there would be some prediction discovered that patients of a certain age, sex, weight, height, skin pigment, and history of chicken pox do not benefit by the procedure. Perhaps that last item is what was necessary to make the pattern significant. That last item was unchecked data. The data has nothing to do with the operational delivery of the service and yet became part of the big data for discovering a pattern.
The distinction between present data and past data is that with present data we still have access to the subject to obtain clarifying or cross-checking information. This access is not available to past data. The point of the above example is that past data will include a wide variety of information types that will include data that was cross-checked at the time of entry and data that was not cross-checked (unlit data in my terminology).
Once we admit that historical data can include both types of data, we lose trust in all data. The fact that some data may not have been verified raises the possibility that any data in the historical data may not have been adequately verified. We need to know more about the data than just its measurement, we need to have additional information about the cross-checking that occurred. Whether the cross-checking resulted in a correction or not, we need to have that record that the cross checking occurred. Lacking that record, we can suspect the validity of that data.
The issue of bad data is not merely that the data is wrong. Correct data may still be bad data if we can not trust it. I tried to clarify this in earlier posts by alluding to the example professions in historical data: historians, investigators, auditors, etc. Among the common traits for these professions is the debating of the strengths and weaknesses of the case. There is generally always an opposing party that will identify weaknesses in the case. The opposing party will exploit any opportunity to cast doubt in any data used to support a case.
Bad data is any data that can not be adequately defended for its trustworthiness. In my above example, how can we be certain that the patient involved was really me? That trust needs much more than a record with my personal identifiable information. That trust needs the additional evidence that this was verified at the time right up to the point when the service delivery. We need some signature to confirm that the verification did occur. Without that evidence, even basic information can be bad data. This is data that is vulnerable to being discredited in argument.
In an earlier post, I described the need to approach historical data in the same way that law approaches presenting a case to a court. Both share the necessity of eliminating doubt to a certain standard. Unfortunately, current practice ignores this reality. We assume that since historical data propagates from operational data, then there is no room for reasonable doubt. We get away with that assumption because that data has not yet been challenged.
I made a point earlier that any decisions based on historical data will be subject to legal or civil challenges. When that happens, the data will indeed be subjected to the same standards of evidence as any other case. A lot of data in big data projects will not stand that kind of scrutiny.
Consider for example a decision based on big data to allocate resources in a way that denies a service to an individual. That individual can challenge the decision by showing that the relevant data is not trustworthy.
There is a fundamental change that occurs when data moves from operational to historical. That change involves how we can challenge the data. Operational systems challenge data with some form of feedback from continued access to the subject. Lacking such feedback options, historical systems challenge the trust in the data. That trust deteriorates as soon as the data becomes historical data. Without adequate supporting data, historical data is vulnerable to attacks on its trustworthiness. Data with vulnerable trustworthiness is bad data.