Big data is often characterized as having three V-words: volume, variety, and velocity. While all three present technology problems for storage, the latter words of variety and velocity implies something more than storage. The technologies of predictive analytics is often described as a supplementary technology to big data, but it is more specifically leveraging the variety and velocity. The goal of the analytics and its sub-discipline of machine learning is to allow algorithms to interpret data in ways that are beyond the capacity of the human mind. The information is coming too fast, or the data has too many dimensions for the human mind to process. We seek algorithms to leverage the opportunities presented by big data but are too big for humans to handle.
In my experience working with large, diverse datasets arriving very quickly, I encountered a similar scenario of automating a learning algorithm. This algorithm had refine a map that placed individual observations into one of several thousand categorical bins. It turned out that we had a prior map available that was good most of the time, but occasionally had ambiguous choices where something could be in one of two or more bin. The machine learning was to use information from the current set of data to resolve the ambiguity. The result was a refined map that was very accurate, most of the time.
This algorithm ran every day and made subtle changes in this refined map as the reality changed. Once a week, we replaced the default map so that the refinements would remain few in number. Occasionally, the refinement process would encounter data that would cause the refinement to be worse off than the original default map. In some cases, the data was accurate clean data, but there was an unexpected change the real world that confused the algorithm. To address this possibility we had a labor-intensive quality control step to check the map quality and then intercede when we detected a degradation. The interception might have been to isolate the confusing situation and recompute the map, but this required re-running the predictive algorithms on all the data again.
We learned that the algorithm could be fooled by the data. As a result, we accepted a quality control labor component to the project. The above confusing scenarios occur without warning so we had to apply this labor every day for a project with a daily reporting deliverable. We needed to have an operator in the loop to at least approve the machine created map and the result of its application, even though the application of the machine-refined map was automated.
I was not paying much attention to the trends of big data analytics with end-to-end labor-free automation. I didn’t think anything about needing a labor intensive quality control step.
My thinking was biased by the earlier lessons about why we don’t have automated airline flights. For decades now, we have had the technologies of auto-pilots and instrument landings that could permit pilot-less flights. Recently, there has been an rapid adoption of pilot-less (or at least remotely piloted) drones that started off being very small but are increasing in sizes substantial enough to carry meaningful payloads. It is not hard to conceive of this technology used to carry human passengers, but we draw the line there. Even in a largely automated flight, we want humans on board to make decisions to approve of the automated decisions.
I didn’t think much about building into the design a need for a daily effort of an analyst to review the actual data to confirm appropriate operation of the algorithms. This data science effort was more intense than a typical operator role to monitor the operation of the system in case alarms would go off. The data analyst reviewed the data for anomalies that could hint that the algorithms did something inappropriately.
Later, I was criticized for designing into the system an expectation for human labor. The alternative idea is that the design could have more automated safeguards if I had the motivation to avoid the daily cost of expensive analysis labor. I agree that the criticism has some merit, but when I started I didn’t think it was controversial to include an intelligent analyst to approve the daily automatically generated data. We were going to use this data for something constructive and that effort would be scrutinized for quality. It just made sense to me that we would invest up front to be sure the data was good (or to identify where the data may be weak) before the next stage used that data.
Even though I recognize the validity of the expectation of fully automated analytics and that the velocity goals of large data projects can not be hampered by human labor to approve the results, I still believe the human approval is an necessary step for even high-velocity analytics. Sometimes the world can present an unexpected scenario with good clean data and that scenario would confuse the algorithms.
I thought of an everyday analogy of a recent shopping experience where there was a sale table piled with socks from brand X. The table had a sign that announced the sale as buy 3 brand X socks for the price of 2. Assume these normally sold for $5 each. Three would cost $10. I pick up three from the table but when I check out I get presented for a bill for $18. This sale worked so that buying the first 2 would be full price, the third would be free. It turned out that I had mistakenly picked up a brand Y sock that sold for $8.
In this example, the data (in the form of merchandise presented to the cashier) was good clean data. The problem was that the world was not operating as it should have worked. I imagine what happened was someone picking up an $8 brand Y sock and then seeing the sale that said he could get 2 more for just $2. He drops the brand Y sock on the sale table and picks out three brand X socks. Later, I mindlessly just picked up three socks from the sale table.
This case was resolved because I realized my mistake and managed to correct the items to match the sale.
Imagine if this were a machine learning scenario. The machine would observe the tables in the store and note that picking various numbers of socks from this particular table resulted in a price break at multiples of three. It learns that there is a sweet spot of maximum benefit when the quantity is a multiple of the number 3 when the item is picked from this particular table. The learning is applied automatically and the operations begin making automated transaction in multiple of three. One of those transactions results in a charge of $18 instead of $10. Being that the algorithm is statistical nature, we may not find this unexpected because we expect the result to average out over the long run. The automated approach may never object to that real world scenario of mixing brand Y with X.
A human analyst demonstrates his value with this kind of scenario. It is an ability to recognize that this just can’t be right. But, this particular scenario is easy to automate by refining the algorithm. I agree with this and that is what our job would be. Once we discover the possibility of a mistake, our tasks were first to correct the problem (in this case swapping the errant brand Y for the third brand X) and second to find a way to modify the algorithm to not make that mistake again. Part of this human labor investment is to revise (or suggest revisions to) the algorithm to make it more robust.
The reason why the human is there is because the error was not anticipated. Modern software design practices appear to accept that there is no amount of prior research that can anticipate all possible errors. One of driving motivations for agile software development to produce small increments of capability quickly is to begin to gather real-world experience of operating those smaller capabilities so that we can more quickly discover the possibilities of what can go wrong (or surprise us in a beneficial way, for that matter).
My complaint about the modern agile approach is that it tends to isolate the developers (who are most aware of how things should work) from the operational environment (where things can work unexpectedly). This deliberate isolation provides the opportunity for the developers to maximize their productivity of producing new software increments. The product owners or stake-holders are tasked to evaluate the acceptability of the final result.
The old-timer in me appreciates the value of an individual who participates in all aspects of the project life cycle. A developer of an algorithm can very quickly spot an errant behavior of an algorithm even though it could be explained away as “that’s just the way the algorithm works”. I don’t think most customers of products will recognize something that isn’t quite right. They are more likely to dismiss observations as that is just the way it is supposed to work.
I imagine this happens very often for consumers of machine-learning algorithm. They are more likely to marvel at the cleverness of doing something unexpected than to recognize that this result is not right.
Any algorithm needs a human approval step before the algorithm’s results are applied to the next step. This is consistent with my earlier posts where I present the analogy of a city of a chain of suppliers that successively refines the data before handing it to the next level. Like a supply chain in manufacturing, each supplier owns responsibility for delivering a quality product to its customers further down the chain. The supplier or the customer or sometimes both employ inspectors to approve the delivery of the intermediate product. The same concept could apply to data projects and provide a similar advantage in being able to catch the unexpected scenario that presents an unacceptable product.
In my project, I later identified an compromise that could restore velocity while still retaining a human analyst to check the results.
The original design recognizes a benefit of using the very latest data to improve the map. The best way to improve the map for the current instant is to see how that data actually is used in the current instant. However, this approach meant applying a freshly computed map to process the data. As a result, the human intervention would occur after the data was processed. If the data needed reprocessing, this would be a huge disadvantage for data velocity.
The compromise was to note that the refined map created with day-old data was still very accurate when applied to the following day. Allowing this lag allowed for human approval of the generated map before the map would be used. Any processing of fresh data would always use the pre-approved map so there would not be a need to fix the processing because we don’t approve the map refinement. This particular compromise was justified for the unique circumstances of this project. We still had the opportunity to flag an degradation that could have been improved with an automatically generated map but we agreed that that improvement was not worth the investment of loss of velocity of of increased resource costs to reprocess the data.
Despite this compromise, there remained a fundamental disagreement about whether it is possible to eliminate the human element altogether. Ideally, the production environment should be free of interference (or cost) of skilled analyst labor. The only need for labor at the production level should be for operators watching and responding to very high-level displays of alarms of something automatically detected as being unacceptable. This production ideal demands that all of the analyst labor be segregated in the development environment fully isolated from the production environment. We demand that the analysts in development comprehend all possible failure conditions either to build better algorithms or to catch those unexpected events to raise alarms to the minimally available production operators.
I’m not convinced this is possible.
In order to achieve the three V-words of big data, we need to insist it is possible to restrict data science capabilities to the developer enclave. This high expectation is driving the presumed shortage of skilled data scientists who are capable of dreaming up fully robust and alarmed algorithms that can handle anything the world can throw at them.
I do not doubt there are analysts who are far more capable than I am in conceiving of robust designs. However, I have the experience of awe of some of the ways the world finds to surprise us. I respect the natural world more than I respect human intelligence or its artifacts. I learned that we need someone there in production who is paying deep attention to that world beyond the dashboard alarms. Data science labor properly deserves investment in production phase.
Data science labor in production phase is what motivated my term dedomenologist: a naturalist of the datum. A naturalist observes his subject in its natural environment. We value naturalists because they’re the ones to tell us that our theories are wrong.