I have been discussing the need for routine examination of data being fed into historical data stores (such as big data) because of various ways that the recorded data may be misleading or obscuring reality that we intend to observe. Over several posts, I identified different categories of data quality issues that deserve this routine attention. This post is a list of the different categories as I have described them so far: Bright Data, Dim Data, Unlit Data, Forbidden Data, and Dark Data.
I defined bright data as observations that are well documented and well controlled. The documentation and control assures us that the observation is precisely of the desired phenomena and that phenomena alone. This is the ideal version of data we use to design our data storage systems. All the effort to assure the quality of the data is done up front before offering the data to the data store. As a result, there is little need to periodically review this data. This is analogous to accessing data from high quality scientific experiments with rigorous standards and thorough peer review. This is highly trusted observation data.
Realistically, observation data will have some flaws in documentation and control. This leads to some concern that the recorded observation may be not exactly what we want or may be biased by confounding information. Most bright data is dim to some extent. This is especially true for routine recurring systems to collect data. Over time, these systems will degrade in some way varying from sensory apparatus wearing out, or routine upgrade of the hardware of software, or operational configuration changes to settings that can change the quality of the measurements.
Because the data collection path is still very well documented, it is easy to design processes and procedures to periodically check for possible degradation or to report on the quality of the observations. These quality controls can be highly automated with only a small amount of periodic routine monitoring with occasional spot actions to trigger troubleshooting tasks. Even these occasional trouble shooting tasks are generally routine at least to the point where the schedule and budget can be predicted with reasonable accuracy.
Dim data is actually realistically available bright data. I described bright data separately as way to establish an ideal form of data that can be completely automated with no need for routine or recurring human labor. Dim data is realistic bright data that has a need for a modest amount of routine data quality checks by human data analysts.
I’d like to relate a personal story. During my most recent trip to the doctor, he inquired about my home blood pressure readings and I said they were around ideal. He then asked when was the most recent time I measured it. I bluffed and said not too recently. He then asked how frequently I measured it, and I said just a couple times a year. When I did measure it, it was fine. That measurement was an acceptable measurement such as it was. The problem is that it was not recent nor was it frequent enough to see trends. The doctor performed the job of the data scientist to review the data collection. In earlier visits, I provided written logs of very frequent measurements so he could have assumed I was still doing this. He still inquired and found out that I was no longer giving him the information he wanted: measurements a couple times a week. The data had dimmed and he caught it. This is analogous to the need for periodic review of any dim data.
The remaining types of data are very different.
Unlit data refers to data available in the observations but these particular data items lack any documentation and control. Unlit data is delivered to the input of the data store and thus are candidates to add to the data store. In fact, most raw big data stores of unstructured data will automatically ingest this data.
Discarded Unlit data
I my last post, I made an analogy that unlit data is like packing material in a shipping container. It is not the product we want to extract from the container but it does take up the space inside the container that is not occupied by that desired product. Unlit data is meant to be thrown away. Except for the suck-in-everything unstructured data stores, a structured data store will ultimately have no locations to record this unlit data so this data will get discarded.
At this time, I don’t see a need to scrutinize discarded unlit data. This is data that was not requested and it lacks documentation for why it might be useful. Given those conditions, it seems perfectly legitimate to simply ignore it and discard it. Scrutiny is required if we decide to use it.
Inintentionally Added Unlit Data
There is an inescapable possibility of unlit information hiding within structured data. One example was a fixed size field that is used to capture a free text entry from a human operator. This entry may have some flexibility to allow for human inputs. This flexibility will permit typographic errors, format errors, or grammatical errors that may be exploited, for example this information may be used to identify problems with the data entry system or the training of the operators. Often, unlit data takes the form of comments intended for communicating between operators and not meant for long term storage. Being embedded in a data field that is accepted into the data store, such unlit data is unlikely to be filtered out. Commonly this type of information opportunity is only recognized long after the data has already been entered into data store.
Often a new question comes up in analysis that asks to find information in the past data where the intended information has never been expressly built into the system. An analysis of the data fields may review a clue available within a data field that can be useful in answering a question. The analyst will use this information to answer the question, but we presume he will make the effort of verifying the correctness of this interpretation or provide documentation to alert everyone of the unconventional approach.
A possible example may be to see if there is certain type of operator error is more common for males or for females. There is no field for the sex of the operator but the operator’s name is available. One may infer the sex of the operator from the first name. This would be unintentionally added unlit data and the analyst will need to assume full risk of this assertion. In other words, the analysis accepts an additional research burden for using this type of information compared with his normal duties using the documented data fields. That doesn’t mean he will.
Intentionally added unlit data
Unlit data may be seen as useful at design stage of a data store. Although the unlit data is undocumented or not well understood, empirical analysis of data samples may identifiy possibilities of using the unlit data to fill in a gap due to unavailable better documented data. In contrast to the unintentionally inserted data, this research labor burden in placed on the data science designer instead of the end-user analyst. The designer will derive a possible use of the unlit data and test it against the available sample data. The designer also will design algorithms to quantify how well new data conforms to his assumptions and add functionality that will raise alarms or reject this data when there is evidence the assumptions were wrong.
An example is finding a piece of information available in a product’s configuration file where that file was meant only for internal use of that product. The vendor informs users that the file is unsupported and they reserve the right to change any format or content in that file without any prior notice. However, the product has been around for some time and is very stable so there is a reasonable risk-benefit trade off to try to exploit this information while adding tests to be sure that something hasn’t changed. The fact that the data is undocumented and unsupported makes it somewhat unlit. However, if the value is traceable to an operational variable, then it is somewhat lit because the system’s behavior depended on that value.
The example I had in mind came from when I was on the vendor side and I was providing support for customer who was trying to use our configuration file. They were using it to extract some information that was useful to them but our product did not yet use. This record provided some information but the system ignored it. Taking advantage of undocumented, unsupported, and unused data gets closer to the concept of unlit data. This unlit data was exploited at design side, but in this case by the customer’s data project.
I described forbidden data as observed data that falls outside of reasonable bounds and thus is discarded. One example is outlier data that is outright discarded because it falls so far outside of reasonable limits.
A more frequent example is the replacement of observation data with a smoothed version of that data. The forbidden data is that difference between the actual observation and the submitted smooth version.
Take the an example of a histogram of observations. We may use a normal distribution that fits the mean and standard deviation of the observed data. But the actually observed histogram will have bins with counts above or below the normal curve. That difference is forbidden data in this model. There may be very good reason to assume a normal curve, but we should be aware that we are rejecting the differences between the observations and the smoothed value. Those differences might be very important.
Forbidden data is actual observed data that we decide should not be added to the historical data store. This may be either an intentional choice (rejecting outliers typically is by design) or unintentional (replacing observations with smoothed versions if often unintentional). In either case we used some assumption or model to set the limits of allowed data. Very often, these models are very highly trusted and reasonable to use. The problem is that of imposing a presumed model on observed data. If our goal is to find something new about the real world, the model of our prior expectations is distorting what we can study. We can end up confirming our assumptions because our assumptions are throwing out the contradictory information.
I argue the these models to set thresholds or to smooth the observations deserve continuous scrutiny and suspicion. One case to be made is that the world is always changing so that some models may become obsolete. Models can also fail when expanding data sets: the more data that is available, the more likely it is that old models will no longer be valid, or the more likely another model may be more appropriate.
Having input processes that can reject observations or replace observations with smoothed versions should imply a need for routine labor of specialized data science analysts. This is a cost that we don’t expect for bright or dim data and often we don’t adequately budget for this if we budget for it at all. This routine and recurring detailed data analysis occurs at the data input end far removed from the end-user analyst that provides the value-added products.
Model generated data (Dark Data)
The bulk of my concerns are with artificial data generated by models to fill in for gaps in data, either missing data points or entire data types. These are prevalent in data systems that try to present a human understandable interpretation of the data. We know something must be present where there is a missing observation or a missing type of data. To make the reports complete, we fill in this data with invented data that our models, hypotheses, or assumptions tell us must have occurred. Frequently we use model-generated data is introduced at the final report stage in order to complete the picture in such a way that emphasizes our confidence in the results.
I spent many prior posts exploring this problem and undoubtedly I will return to this topic many times in the future. I use the term dark data to make the analogy to astronomy’s dark matter to explain motions of galaxies or dark energy to explain motion of the universe as a whole. These are extreme cases where these assumption have survived an extensive amount of peer review and criticism so there is a widely acceptance that the assumed data exists even though we lack any direct observations. Most real world counterparts don’t enjoy that level of scrutiny.
Another important lesson is in the the astronomy example. Even though these assumed values are widely accepted, the community continues to invest highly specialized analysis test these assumptions. The lesson for data science is that making these assumptions should also imply a commitment to a significant recurring cost. We should never be satisfied with model generated data until we can find bright-data type of observations to confirm it.
This lesson is often lost in everyday historical data projects.
Dark data is a necessary component to a historical data project. But using it should necessarily commit us to routine rigorous analysis to improve the model and to seek out bright-data observations to obviate the need to invent data.
Dividing data in this way is useful in the planning and design phase of historical data systems. Identifying different species of data can inform us of additional resources needed for less trusted data and that includes budgeting for routine skilled labor to scrutinize whether new data is still within our our original design assumptions.