During my previous assignments, I was challenged to describe exactly what my product did. I would begin describing how I used queries to categorize data and summarize the categories in order to populate reports that included hyperlinks to map to other reports that provided relevant data to likely questions a person may have about the data. The description inevitably involved some kind of demonstration. To people who were aware of existing capabilities such as data marts, business intelligence tools, or hadoop and map/reduce, they would immediately equate my attempts with these alternatives. In fact, my last performance for the project was to show how these technologies could replace what I created from scratch earlier. For this post, I’m including even myself as someone who may have oversimplified the project.
Some aspect of the project I had been doing for over decade was fundamentally different from the usual approaches of accumulating a huge store of raw data and then querying that raw data in one step (map data to categories then summarize the categories) to populate a rich report. To clarify the distinction, I suggested a concept of a data life cycle where the data goes through successive steps toward an ultimate goal instead of attempting to make the leap from raw data to a final report in one step. Although my project included some single-leap reporting, that was a special case of the more general multi-step approach.
To make my point, I offered my own definition of big data. This definition is inherently recursive. Big data is data where mapping the data into categories and summarizing those categories results in big data. The point where a mapping and summarizing can directly feed a human readable report is the point where the source data ceases to be big data.
As I mentioned in earlier posts, I approached the data projects with an inherent assumption that data requires extensive scrutiny. To facilitate this scrutiny, I divided the project into many stages with intermediate checkpoints to check for problems at different levels of abstraction. These checkpoints involved intermediate persistent data storage of the intermediate summarized data. The multiple step checkpoint and intermediate storage permitted a roll back to the most previous stage without having to start all over from the raw data.
In hindsight, it was a poor choice of words to describe the process as data life-cycle. Perhaps a better term would be an assembly line of information. In contrast to the documentation of data design as a subset of the software design discipline, my focus here is on the information content of the freshly arriving data instead of the data structures that carry that information.
An analogy for this concept is the process of building some retail product in distinct steps with intermediate results transferred to intermediate factories:
- A process to extract or harvest raw materials from the Earth into shippable units
- A process to refine that raw material into bulk stock
- A process to mill that stock into a form needed for manufacturing
- A process to shape that prepared stock into a product
- A process to finish the product
- A process to offer that product to the customer
In that analogy the manufacture up to step 5 correspond to my big data steps, and step 6 corresponds to presenting a report. There may be any number of intermediate steps. A special case may be a single step of direct retail delivery of exacted material: for example, bottled water (over simplified as pouring spring water into jugs). In general, most products require are multiple steps each with its own set of handling procedures and quality checkpoints.
I think this staged refinement approach is very important for large data projects because there are so many ways that information can fail or become corrupted. I found it easier to separate the data scrutiny into different levels of abstraction that match with intermediate levels of categorization and summary.
To repeat an earlier point, I started tackling the project this way from the start. I never took seriously the concept that robust and highly trusted algorithms could be created to go straight from raw data to a finished report in a single step. I do recognize that many projects take this approach and appear to be successful, effectively cutting out all of the labor-intensive checkpoint checking inherent in my approach. Despite their successes, I am not convinced I can trust their results. Lucky for them, I’m not in a position where my trust needs to be earned.
In many prior posts, I described many ways I saw where data should be suspected. Direct observations could be dimmed by incomplete documentation or control. Model data can bias the information with presupposed theories that can interfere with our ability to recognize a changing understanding of reality. Some data may exist that has no operational purpose and thus is untested by the world. There are many problems I have not yet discussed involving the handling of data after it is observed but before it reaches my project.
In real life, it is my nature to be very anxious and jumpy. I naturally seek out potential risks either in preparing tests for them in advance, or to look out for any I may have missed earlier. A nervous approach to data is to envision an multi-step checkpoint/rollback assembly line with labor involved in each step.
To a great extent that the assembly process can be automated with only occasional intervention by a human analyst/operator. But inherent in the multi-step design is a requirement for a lot more processing and storage resources than what would be needed for a single leap from raw data to a finished report. I’m challenged to defend my resource expensive approach compared with a simple single leap approach.
My assembly line approach is a hard position to defend against a single leap approach. It is hard because the single leap approach does not present the opportunity for someone to observe the intermediate results in bulk. In contrast when I see the entire intermediate summary in bulk, I can’t help but to find patterns that don’t make any sense but to suggest something is wrong. In other words, in order to see problems at intermediate stages, you have to be able look at that level of abstraction with its own reporting tools.
The best way to be sure that the successive abstractions use the best vetted information from the intermediate abstraction is to have the successive abstraction work from the intermediate data store. The alternative of going back to the raw for a more summarized report could unexpectedly introduce its own intermediate error.
Look back at my hypothetical material-world manufacturing process. Each of the steps were meant to imply geographically separate factories each with their own capital and labor resources. I see the data assembly line in similar terms. There are multiple teams (in my case, one individual who did different tasks at different times of the day or week) working with separate databases to store the intermediate results.
For each database the job was narrowed to what was necessary to get from one intermediate result to another. A particular stage had its dedicated database and its own rich suite of inter-connected reporting tools to perform routine checks in order to approve the delivery of its summary data to the next stage.
A single leap approach with a single data store promises substantial cost savings over my approach. I still stand by the multiple step approach, but I admits it is hard to defend.
The first argument is that since I claim to know how to spot certain problems in order to invent intermediate steps, I should be able to spot the same problems with a single leap approach. It is just requires another single-leap report. Although different reports are used for different questions, good practice can assure reuse of algorithms. I concede that this is true. If I were the one who was tasked to produce the finished report, I would know how to spot problems and look for problems lurking behind the scenes. But the point of the project is that a team of analysts can use the same data. Even if all of the analysts were well-skilled to spot and investigate problems, it is waste of labor to have each of them tracking down the underlying problem. That underlying problem could be done in advance by a single analyst. That single analyst could resolve the issue (fixing the problem or alerting everyone what is wrong with the data) before the regular analysts start their jobs.
Realistically, most final-data analysts are experts in their domains instead of experts in the fine points of the life of data. They are most efficient in their duties when they can be assured that the underlying data is trusted data. Most analysts lack either the patience or luxury to spend time scrutinizing the data itself. Their work is facilitated by accessing stored data that was previously summarized and verified.
Also, I object to the notion that the use the same intermediate algorithms (through code reuse) will produce identical intermediate results both for an intermediate-level report and a final presentation report. In a single leap approach, both reports will run against the base data at different times. The data could change between the two runs. If the intermediate report runs first, the following final report may include an error not previously present. Conversely if the final report is run first it may encounter a problem (such as missing data) that gets resolved before running the intermediate report. Having the successive data stores assures that the later queries use the exact same intermediate data that was checked.
A second criticism is whether I’m being just too paranoid. Most of the problems I seek with intermediate steps are rare. Many that do occur are eventually self-correcting. Although my multi-step approach of scrutinizing intermediate results can speed up that correction, an argument can be made that this advantage doesn’t justify the added cost. Often the errors are minor and do not significantly impact the mission: we learn to tolerate or even expect transient periods of faulty data.
The second criticism is a strong argument. Big data projects are growing rapidly because they are limited only by the technology. The technology continues to grow rapidly in terms of its capacity to handle even more data. In several earlier posts, I’ve asserted that the limitation to big data projects is labor not technology. If we agree that routine labor intensive checks for data quality is needed, then the growth of big data will severely hampered not just by the relative slow process of human scrutiny but by the lack of specialized labor who can do those duties. The demands for ever bigger data projects puts the burden on the data scientist to prove that this intermediate-step labor is necessary.
To that challenge, I offer the defense of what is the cost of regret. The problems may be rare, may be self-correcting, and usually are minor. But the credibility of the entire project is on the line every single day. There can be a huge cost penalty for making a mistake. That mistake may be missing something that should have been caught, or it may be make a mistake that should have been avoided (see for example a recent license-plate reader error). The cost of embarrassment, of having to defend a mistake, can be very steep. The cost could be the loss of respect by the stake holders and that could put the continued support of the entire project in jeopardy. The purpose of the project is to facilitate decision makers, to not miss opportunities and yet not make incorrect recommendations.
This data is being used for decision making where the decisions have real consequences. Real-world data is prone to errors in the information carried. These errors may be occasional or rare, but they could be enough to influence a decision. If that decision were found to be wrong and that problem is traced back to bad data, then the entire project loses credibility of being worth the investment.
That’s my take on it. I’m naturally an anxious person. I naturally worry that the information at my finger tips is not representative of the real world. I find comfort in a multi-step assembly line approach of successive summary of categorized data with intermediate checkpoints overseen by operators.