In many earlier posts I described my concerns that we do not budget enough labor for routine data-science scrutiny of data during the operational phase of big data solutions. I described multiple ways that data can be suspect and where each way required different skills and diligence.
In other posts, I discussed my concerns about using big data at the individual level where some of that data may be less than bright data. I specifically was concerned about model-generated data applied to gaps in information. Effectively these are assumptions turned into data points that intermingle with real observations. Once in the data store, we are likely to treat them as equal to observations. I noted that often the models involved are highly trusted so this is not entirely unreasonable, but they are not identical to actually observed data.
In recent posts I discussed my concerns about using big data at the individual level for law enforcement and most recently in terms of determining the provisioning of health care to patients. Both cases have very strong and legitimate motivations to leverage big data technologies to realize efficient and cost-effective solutions.
My objections primarily focus on the data science issues of assuring absolutely high quality of the data during the operational phase of the big data project. The operational phase is after the initial engineering and testing. The operational phase continues to accept new data from established sources and feeds that data into the big data data store.
My observations is that all but the best documented and controlled (bright) data is subject to unexpected changes in quality. There is a constant need for regular evaluation or scrutiny of all data to assure that they remain representative of the real world.
Compared with big data solutions used and being developed for law enforcement and health care, the project I worked on was very small involving only a few terabytes of data with about a dozen different sources. Despite the small size, I had no problem finding on a weekly basis significant unexpected qualitative changes in data that needed immediate attention. These problems occurred even though all of the systems were working properly. The data itself degraded.
Based on that experience, I would expect much larger projects to need even larger data science teams to manage the incoming data. This is behind-the-scenes labor unrelated to the primary mission of the project represented by analysts running reports. This labor is careful analysis of the data to be sure the supposed observations continue to relate precisely to the real world.
As I mentioned, most of the data available in big data stores have some kind of problem associated with them. The best possible data (that I call bright data) is actually very rare.
An example of what I would consider to be very bright data is the police radar used to enforce speed limits. Radar technology itself is very mature technology and has a long track record of very precise measurements. In addition, the components have improved with solid state technologies with very tight tolerances and accuracy. The use of radar to measure relatively constant speed vehicles would seem to be very simple compared with its successful applications in far more challenging scenarios. At least to me, this would appear to be very reliable data.
Yet, radar data is frequently challenged in court as possibly being wrong. The challenges are often successful if there is lack of proof that the equipment was very recently calibrated, and that the operator was recently certified in its use.
In courts we have very high standard for quality of data admissible as evidence. If something as robust as radar can be shown to have a reasonable doubt, then certainly most other types of data is subject to even more doubt.
The court standard is applicable to the above examples because they involve individuals. Using big data for law enforcement or for provisioning health care to individuals should subject all of the big data to the same high standards as the court demands for simple radar data used in prosecuting traffic violations.
To be usable in these contexts, all of the data in the big data store should be subject to the same rigorous and recurring scrutiny to confirm that the data continues to accurately represent reality.
Just like the radar example, this requires an investment in recurring labor to calibrate equipment and train operators. Radar data for a single traffic stop is tiny in comparison to the scale of big data systems.
I very much doubt this systems invest in routine data scrutiny at a comparable level of diligence we require of traffic radars.