Reverse data governance, protecting source integrity

A lot of talk about data governance (such as this article) concerns the problem of assuring high quality data delivered from external providers so that the data meets standards defined or imposed on the client of the data.  The final user of the data needs to assure regulators or other governing contracts that the data it is using is accurate and was handled correctly.   The user can assure this by imposing on data providers a contract to deliver data meeting specific terms including being held accountable for any fault in not meeting the needs for the data.

This concept of data governance applied at the end and propagated to the source does not address the perspective of the data provider’s need to defend the value of its data.

As demonstrated by the fact that a client contracts with with the provider to obtain the necessary data, the provider’s data has inherent value.    The provider may have multiple clients with varying contract terms or perhaps he has standardized to a particular set of standards such as complying to particular government regulations.    The flaw in the governance policing at data delivery point is that this requires the provider to release the valuable source data to that client.    After releasing the data, the provider loses control over protecting the value of that data.   The provider must rely on contracted assurances that the client will handle the data properly to prevent data corruption, contamination, misapplication, or improper distribution.   Any such misuse of the provider’s data can result in real harm to the provider.   The provider is at the mercy of the client to avoid damaging the reputation of the provider’s data when the client misrepresents its handling error as traceable to the provider’s data.

While it is easy for the client to enforce its data governance standards on data it receives, it is much more difficult to for the data provider to police the client for adhering to his terms.

Part of the data governance involved protecting privacy or otherwise sensitive data from unauthorized disclosure.   If this sensitive information is at the data source and needs to be delivered to the client, then again the provider loses control over protecting the data.   Although contractually the secure handling of sensitive data is handed off to the client, the provider loses direct control over protecting this information.    A provider may have multiple clients.  When a breech is discovered, the provider may be held accountable without an effective means to identify which client mishandled the data.

In an earlier post, I suggested a revised data supply model where the value-enriching data stays at the source and the clients instead send their data to be enriched to the provider.   The client’s data will probably be stripped to the minimal necessary to allow for proper keying with the enrichment data.   The data returned would be only the requested enrichment data and only that data that matches the key.

This model is identical to the old database model where the server holds the data and then must supply precise select queries that specify constraints on columns and rows that can be delivered.   Database servers long had security protections to grant different levels of access for individual tables, rows, columns and even cells.   The client would need to provide a correct query and the the results would be exactly what the query requested.   This old database model had a built in support for data governance especially from the point of view of the owner of the valued enrichment data.   The data stays in the control of the provider and only the minimally necessary data is released to the client who in turn will still agree to handle the data to properly protect the value of source.

The idea of a data supply chain may be extended multiple levels deep.   This introduces a multiple tier query approach that historically presented significant performance challenges.   With the modern availability of robust cloud-based data storage, it is possible to have consortia that will coordinate deployment of their data to optimize multiple tier data enrichment for specific markets.    I used the analogy of the automobile industry where the parts providers were located conveniently to assure timely delivery of parts as they are needed.   A similar staged approach can occur with commonly combined data enrichment providers.

This supply chain approach keeps the data at the source and allows the source to police the necessary rules to protect the value of its data.    In addition, the source has an opportunity to optimize the data and its processing based on the history of queries based on performance or frequency.   If the data would be delivered to the client, the client would have the burden of optimizing this processing.  And if there are multiple clients, then each one will have to reinvent the optimization processes even though their queries may be very similar.    Just as in older database implementations, it is more efficient to have these optimizations performed with a centralized team who would be skilled specifically on the data and the possible queries for that data.

This approach may have more broad value in context of protecting privacy.   Starting a process to have a data enrichment supply chain may permit individuals to manage their own data.   If someone wants an individual’s private information, then the query request would have to go to the specific individual who will retain ultimate control access to release information that they consider is private or sensitive.

Cloud technologies can enable this by co-hosting this sensitive data of individual-level data stores that will individually authorize queries of their private data.   The cloud service could provide algorithms or protocols to retrieve this information from arbitrary populations in an efficient manner even though each individual’s service will handle the actual query for this information.    There can be communities set up to arrange for collective agreements with standardized terms so the downstream clients will have an opportunity for efficient implementations widespread queries even though each individual retains control for granting access.    Hosting this data in separate virtual stores within the same cloud service could offer the opportunity to develop efficient protocols for bulk querying even though each individual has ultimate decision of whether to allow the query to access his data.   Again, this is very similar of how row, column, or cell level access control works: each level has a final say on whether to allow access to proceed.   This fine-level access control can occur in the cloud services of many virtual data stores.

This governance discussion starts from the data source perspective who has a stake in protecting the value of the data that is being requested by the downstream clients.    Ultimately, there remains a need for the client to demand data governance that if the source permits a query than the source assures the client that the data is valid, timely, and appropriate to the query.

For example of individually policed privacy data, an individual has the option of denying access from a particular query.   However, if the individual agrees to release that information, he agrees to release valid data appropriate for the query.

Any data source that agrees to honor a query should not have the option of providing invalid data.   There would need to be some technical solution to assure that released private data otherwise only available to the individual is in fact accurate data.    This may involve delivering data with a digitally signed certification of a trusted authority that has confirmed the validity of the data or a legally binding agreement by the provider that the data is the best according to his ability to verify.

The technical performance issues for a distributed data enrichment supply chain may be addressed by optimized protocols for data hosted in cloud services.    However, having individual signature data for each returned data value will increase the volume of information retrieved from the query.  The query would return both the requested information plus a chain of certifications that each source verifies the their commitments to the validity of their contribution to the data.

I believe this extra certification-chain information is a benefit and ultimately would be mandatory.

In an earlier post, I described a news report that displayed a copy of an email message printed in familiar email format and asked the viewers to react to the message in the email.   From the evidence provided, I had no way to independently verify the validity of the asserted fact that image represented what actually was sent by a particular person at that particular time.

This is true for all data arriving at the downstream client.  In general, it is true for nearly all data in most data warehouses.   For all of this data, I would prefer to have the actual data accompanied with a signature of authenticity that will become invalid if any aspect of that data is inconsistent with its content at the time of the signature.   In addition to confirming this historic data remains authentic, the signature offers the additional benefit of detecting intermediate handling errors such as what may happen when a data item becomes incorrectly associated with data from a different source.   The signature would fail to confirm that combination of data is valid.

As we discuss data governance policies, we need to consider the governance needs of the entire supply chain of enrichment data.   We could move to a model that every request for some enrichment data will be delivered as close to the source as possible so that source can apply its own access control policies to protect the data.   In turn, each source will individually supply signed certifications of the validity of each data item returned so that the down stream client can verify the data was not mishandled in the chain.    The signed certifications may come from third party data authentication services.

Deep application of data governance can provide many benefits to both the downstream client by assuring delivery of valid data and the upstream provider by its retaining full control of access to the source data.

Advertisements

One thought on “Reverse data governance, protecting source integrity

  1. Pingback: Big Feedback: system dynamics when big data meets big audience | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s