Cleaning big data, what happens to the dirt

When I began this blog site, I came up with the phrase hypothesis-discovery to describe my experiences working with unruly data.   Our project gained a reputation for discovering things that others missed and I attribute that to how we handled the data.

One of things we did differently was to postpone the data cleaning operations until after the data was in the database.   The data cleaning process involved successive refinements of the data from one table to the next but where we retained the earlier data.   This provided a benefit of being able to perform cleaning operations using the same effective technologies for query and reporting available through the database.   It also provided the benefit of granting us perpetual access to the as-delivered data, the dirty data.

In many projects the data cleaning process occurs outside of the database as if the entire concept of a database is that it demands only the purest and cleanest of data.    While the database has various tools to insist on clean data, such as column constraints or triggers, these tools are optional.

In the SQL Server environment I am familiar with, the Integration Services provides a variety of tools to managing workflow for cleaning data.   Although the tool can handle a range of data operations from external sources, it is perhaps easiest to use as an sequencer for data operations within a single database.   These are operations of adjusting data from a less clean table of data so that it will be appropriate for a more clean table.    Eventually there will be a final clean table that will support end user reporting for an analytic database.

As I read about other projects, I realize that these projects often invest in the cleaning before the data arrives in the database.   One of my challenges was communicating my approach because I had no single database, but instead several.    In typical projects, it appears they can confidently talk about “the database”– there is the database, and there is everything outside of database.

Their database is a fortress for holding clean data.   My approach was to treat the database system more like a small city of small artisan businesses that did various tasks where only one of the businesses was the purveyor of the clean data.

I didn’t treat the database as some temple that must be respected for its elegance and cleanness.   I treated it like the streets of the city, not filthy but certainly not polished so shiny that you have to take your shoes off to enter it.

The result is that I had dirty data in various stages of refinement residing in some location on the database.   This data remained available for ad hoc querying at any time.    We frequently resorted to that data to investigate how the final result was derived.   One way to describe this is the postponement of applying cleaning of data or of applying business rules until those rules were needed.

It was this access to less clean data that provided the edge in discovering unexpected problems.    In earlier posts, I described the problem of imposing models on data where we replace or substitute assumptions for how the world works with what was actually observed or not observed.   I called these as forbidden data or dark data, respectively.   Applying business rules frequently involved this kind of data.  To make the data clean, we need to impose constraints on what can enter the data, and we sometimes need to invent substitutes for missing data.

With the delayed business rule approach we used, we developed a culture of writing many queries to provide cross-checks for data.   The cross checks may involve testing for cleanliness such as the uniqueness of keys that matched any foreign key, or whether there were any extreme values.

Alternatively, the cross checks would involve implementing the same algorithm two different ways.   One query may provide a total sum of some quantity while another query will list the components of that sum with the additional burden of finding the matching keys to present a human-readable report.    Even when the two reports provided the same total value, the illustrative naming of the components could give clues about something not be being correct.    It was like finding an ant in a cookie jar: it was not supposed to be there but it had all along been adding to the reported weight of available cookies.

From the context of a cookie jar, especially that is on a display counter for customers, an ant would represent dirt that didn’t belong there.    For a cookie store, this must be avoided or if discovered immediately resolved, perhaps by throwing out the entire jar and replacing it with a replacement batch.   The customer (or county inspectors) must never see ants in cookie jars.

I like that analogy because a proper business rule would be to remove any ants that entered the batch before getting the jar on display.    Since this removal would occur before any jars were involved, its occurrence would not be available for discovery by the customers or store-front inspector.   The occurrence may be recorded in some manually entered log book, but that book would be hard to query.   Usually when it comes to logged data, we only check that the log book is being maintained rather than actually search for something in it.

In contrast, a delayed business rule would at some point have a jar containing cookies plus an ant.   The presentation jar may have a step to remove that ant, but the earlier jar had the ant in it.   If we could somehow photograph each step, we would have a record that at one point there was a jar with cookies and an ant.   Alternatively, perhaps we weighed the jar before and after, the weight would not be identical.

It is the ability to discover that an ant managed to get into the cookie batch is similar to a frequent example of the kind of discoveries we would find.   We had the ability to produce similar database-driven reports for both the clean tables and the less clean tables.   We could even combine the two tables into one report and then see the discrepancy in a single report.

I admit is it bad thing to have an ant get into a batch of cookies for a cookie retailer.   However, because it is a bad thing, we should have a way to discover that this is happening.   We should have equal reporting access to the ant as we do to to the clean data.    This is important because we can observe that somehow an ant got into the batch of cookies.    We can then work back and figure out when it got there, and how it possibly could have got there.   Perhaps it was in the jar before the cookies were added, or perhaps it was on one cookie that could be traced to a particular table where the pan was resting.    The discovery of the ant and tracing back to where it may have first entered is what provided the unique value of our project that had easy access to dirty data.

In a later project, we employed integration services to produce an entire schema of nothing but clean data with hard constraints on many columns.   This schema was mapped to a multi-dimensional database the further required everything to be fully an unambiguously populated.  When that project was complete, we had a multi-dimensional cube.

I recall my first reaction when I started to use the cube.   The first step was pulling out the measure and it immediately displayed a total as a single cell.    This one number is the absolute sum of every cell of the cube across all of its many dimensions.

This was exciting because from that point, there is no way I could find a conflict.   Adding any dimension to the query would immediately and properly subdivide the measure to the different categories and add up to exactly the same number.   There is no way the cube (properly constructed) could come up with a contradiction.

This was also disappointing for the same reason.   The cube and the back-end relational data was completely free of any conflicts.   There was no way to query either of these databases and discover something might be wrong with our understanding of the real world.

I mentioned in posts the concept of having a clean database of the single version of truth.   By basing all decisions on this single data source, the decisions will not conflict with each other.   There is real value in having a clean database of a single version of truth for making decisions.

However, this clean single version of truth also represents an idealization of what is really happening in the real world.   In this clean view of the world, there are no ants and they never get into cookie jars.    In informal conversation among decision makers, they would admit they know that the ants sometimes gets into cookie jars, but this is well-handled by cleaning processes so it should not cloud their decision making.

There is much talk about the combination of predictive analytics on big clean data.    I recognize there is power in exploiting large amounts of data, and that power is enhanced by the idealized cleanness of the data.   However, I wonder about the relative value of predictions based on idealized data compared with predictions based on dirty data.

In my experience, there can be tremendous value in discovering and tracking down that ant.

Advertisements

3 thoughts on “Cleaning big data, what happens to the dirt

  1. Pingback: Gossip Journalism: big data can use bright dirty data unfit for publication | kenneumeister

  2. Pingback: Databases motivates philosophy with multi-valued logic anticipated by Buddhist thinkers | kenneumeister

  3. Pingback: Data is antagonist of science | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s