Multiple versions of truth

Data warehouses provide a single source to share the same data throughout various parts of the organization.   One of the promoted advantages of such a system is that it offers a single version of truth that all organizations can use for their analysis.   This single source of truth makes it more likely to produce compatible reports created independently by different departments.   The results are compatible when the reports overlap, such as when two departments discuss sales figures.   The results are also compatible when one report covers material that is related to the other, such as the number of shipments being related to the number of sales.   The design and maintenance of the data warehouse has the goal of making the available data as close to the truth as possible so that this truth can be distributed throughout an organization.

In my last job, I worked in an enterprise that had this type of organization wide data warehouse.   Although this was useful, the department I worked in had additional data that was not in that data warehouse.   For multiple reasons it was not practical to add this data to the data warehouse.   Also, at the time this data was only relevant to the mission of this single department.   Our solution was to produce a separate data mart that handled this unique data but reached out to the central data warehouse to obtain the additional relevant data that was the accepted truth.

We confronted many problems of the matching observations in our unique data to the clean accepted data of the corporate data warehouse.    One of these problems was that when our unique raw observation data conflicted with the accepted data.  In one case, it was a conflict on the definition of a customer.   For good reasons, the central data warehouse defined a customer as someone who has fully entered into a service level agreement and has his bills paid up to date.   The problem is that our data showed legitimate cases of delivering services to entities that were not yet official customers.   These include entities that in the process of becoming a customer or who have terminated their contract but had not yet fully left the service.   These non-entities from the point of the central data warehouse were relevant for our purposes.    Inevitably our solution ended up with a different version of truth in the form of our refining the definition of customer within our data mart.  We needed to develop a new ways to uniquely distinguish these phantom customers and yet still distinguish which ones were or were not recognized as customers within the central database.

This kind of problem can also occur in the central data warehouse.  This may be addressed in various ways by careful modeling the data with specific access privileges to make the right version of the data available to the right departments.    We encountered it at the department level because we had data we could not hand over to the central data warehouse.

We encountered another problem in that our data provided redundant observations.   We would measure the same basic quantity but at different points.   These quantities must be identical in the real world.   There could only be one true value for any particular case.  However, we had observations from multiple sensors and they would not agree.    The sensors distributed over the globe had different model numbers, different vendors, and different configurations.    Sometimes the differences in measurements were minor and easy to explain as minor time-of-measurement errors.   Even these small errors needed to be reconciled to report the one underlying value accepted as truth.   We also encountered much larger errors due to different approaches for making measurements and even due to outright system errors involving faulty hardware or even incorrect implementations that needed upgrades to fix.

In the above example, there is a single version of truth but there are multiple observations.   The first question is identifying the single version of truth.   We approached this initially from the identified root cause of slight differences in time of measurement.   We selected algorithms that satisfactorily handled this root cause.

Far more challenging were the cases where the observations were due to faulty hardware, mistakes in configuration, or even software errors in the sensor that occasionally (but not always) generated incorrect results.    An example of the latter case was a fault that occasionally reported a value that was exactly twice the real value due to a fault that occasionally would duplicate an observation.   Adapting to these errors required investigations to find and characterize the root cause.   If the root cause could not be corrected such as by replacing faulty hardware, we needed to introduce new algorithms to select the correct value so that we would continue to report the true value.

As things proceeded we recognized there were two categories of data.   The one category is the accepted true values that were available for analysis.   The other category was the real observations that formed the basis for deriving that accepted true value.   Sometimes the real observations disagreed wildly from the accepted value.

If this process were implemented in the centralize data warehouse, the department very likely would not have access to this rejected underlying observation data.   The reason for this is that the department would define its requirements for valid data and the data warehouse group would commit that that specification by developing their own processes for deriving that truth from the raw observations.   The raw observations would not be available to the department because the department would not be able to justify that need in terms of the department’s mission.   Because we established our own data mart, we had access to all of the original observations including the observations that were rejected.

This gave us an opportunity to examine the rejected data in comparison with the accepted data.    For example, we had some sensors that could measure down to single digits while another would measure only to the nearest thousand.   For our purposes, we would select the more precise sensor for the true value.  But we still had access to the less precise measurement.   That less precise value became a second version of the truth.   That second version of truth often became valuable.   For example, a configuration change (such as an upgrade) may change how the two versions of truth relate to each other.    An analogy may be that the less precise sensor may change its algorithm for rounding to the nearest thousand value.   The less precise value may still be rejected but it may suggest that accepted more precise value may be not be accurate.  This inaccuracy may be discovered by an independent application of the new rounding algorithm on the more precise where the computed result does not match the value reported by the sensor that automatically does that rounding.   We may discover a reason to suspect that the accepted truth may not be accurate.

In fact with our access to the actual observations in addition to the select trusted value, we found many occasions that challenged the accepted value.  This allowed us to identify improvements.   This could occur in the case where the data warehouse team would take responsibility for inserting the selected observations into a data warehouse.  My experience was that many of the problems we found could only be found by this department that had a high stake in this particular type of data.   I doubt if this can be adequately explained to a centralized team that had to simultaneous manage ingest of data from everywhere in the organization.   Our department had a high stake in this particular subset of the truth and we discovered a need for access to the original observations even though we probably would not have been able to justify it initially if we had to submit a request-for-change to the centralized data warehouse.

After over a decade of experience with this data, the concepts have matured to the point where the data will probably be integrated into the central warehouse to get the department to be consistent with the organization objectives of the single version of truth.   With the experience gained with working with this data for so long, the department has the information needed to define its requirements to be appropriate for its needs for this data.

However, there is a recent development to build a parallel data warehouse solution for something termed SIEM (security information and event management).   This is a big-data solution that parallels the central data warehouse but its goals are opposite.   From what I learned about SIEM, I would characterize the goals of SEIM as presenting all of the possible observations contending for the status of being the truth.   It is doing the same basic objective of our internal department solution to confront multiple potentially conflicting observations of the same basic event or quantity.

The SIEM tools are targeting a completely different need.    In contrast to the operational needs for a single version of truth to be provided by a central data warehouse, there is also a need for multiple observations to track down and resolve problems.   The term security in the name of the technology refers in part to its exploiting detailed log data at different devices where that log data was often meant for internal monitoring purposes.   I recognize that the term also refers to the value of using this data for investigations to improve security of information systems, but its utility is broader than finding malefactors.

What is interesting is that the capacity capability of SIEM tools makes possible to enable much more extensive logging of operations of all equipment.   This logging is including operationally relevant information.   Inevitably, the SIEM tools will include multiple observations of the single truth value reported by the central data warehouse.    The two projects overlap in scale and scope.   Both projects have the ultimate goal of covering the entire organization.   The difference is that the SIEM will have redundant potentially conflicting observations while the data warehouse will settle on the single value accepted as truth.   On the other hand, the data warehouse will contain human-generated information unavailable from automated log files.

The two technologies represent substantial parallel investments in license fees and hardware operational costs.   However, there is requirement for both because they are serving two different essential missions of the organization.    The central data warehouse with its single version of truth is essential for operating the business.   The SIEM tools with its access to all possible observations is essential for investigating problems in performance, configuration, system faults, or security incidents.

In earlier posts I proposed a dichotomy that separates sciences between present-tense operational sciences and past-tense historical sciences.    The former deals with the challenges of effectively interacting with the real world.   The latter deals with the challenges of reconstructing what really happened in the past.   I think these two data solutions match this dichotomy.  The central data warehouse for running the organization is present-tense operations.   The SIEM tools for collecting all observations is past-tense analysis.

There is still an expectation that there should be just one single data store for the entire organization.   Perhaps the two requirements eventually will be combined into a single solution.   But at least from my limited perspective, it appears the two are supported independently for different objectives, one to operate the business and the other to manage and protect the infrastructure.   The two objectives are naturally distinct.

Data warehouses have a long history growing out of database technologies using data modeling in SQL databases to optimize schemas and technologies for specific data warehouse purposes.    In contrast, the SIEM tools exploit the newer NOSQL approaches that are more ideal for handling large quantities of unstructured data.

The careful structuring of data into older database technologies is ideal for the promise of the single version of truth.   In a highly normalized database, there will be only one value for any particular quantity.   Clearly, to operate a complex organization, there is a need for various departments to use the same data.   Database technologies are well suited to manage this kind of common data.

However, this very same structuring of data necessarily eliminates the redundant and conflicting data available from the raw observations.   The NOSQL unstructured data approaches are much better for being able to manage this type of data.

There are legitimately two separate data requirements to support an organization.   The organization needs consistent data to operate the business.   The organization also needs access to the all of the original observations to investigate historical (maybe seconds old, but still lost to history) in order to make new discoveries.   These discoveries may to find faulty equipment or configurations, or it may be to track down some malicious activity.   The objective of making new discoveries places a premium on access to the original observations over the selected one value that fits the current definition of truth.   The objective of the discovery is to challenge that truth.