Data Supply Chain: data enrichment close to the source

In an old post, I attempted to make an analogy of today’s data with the earlier days of computing and networking before the realization that some participants may not be benevolent players.

The early data of personal computing and Internet standards were based on an open approach to maximize the adoption of the technology by keeping things simple so as to not require a lot of training or skills to begin to participate, including to contribute new value by writing new software.  This approach was successful in popularizing these technologies.  I imagine that today’s ubiquitous computing at the consumer level would not have developed without this initial openness.   Underlying that openness was the presumption of civilized cooperative behavior.   We assumed a contract that people would behave themselves.

Over a period of two decades we were repeatedly shocked that someone out there would not be so neighborly.   In fact, we found that there were a substantial population of people who readily devoted their energies to exploiting the low barriers to impose unintended costs on the cooperative community.   These exploiters were not only intent on gaining unfair advantage but also to inflict outright harm to the gentler community.

In my post, I attempted to draw a parallel with our current fascination of freely flowing data as being similar to that earlier era of freely open technologies.    The difficulty of conveying that analogy comes from the fact that the earlier vulnerabilities involved hardware: computers, processors, memory, network bandwidth, network hardware, etc.   The vulnerabilities resided in the software but the targets were hardware.

With data, the nature of the vulnerability location and target is fundamentally different.   The vulnerability resides in data instead of software, and the target is data instead of hardware.    Data needs to be protected because there are highly motivated actors who will eventually find ways to exploit it to gain unfair advantage or to inflict widespread harm on others.

The difficulty with making this case is that we have become so accustomed to thinking about information security in terms of hardware and software.    We have developed elaborate systems for protecting computing centers, networks, and software with multiple levels of defense.   These defense systems attempt to assure us that the use of the infrastructure is limited to authenticated authorized users who will be traceable to non-repudiable log information so that we can quickly quarantine malevolent actors.    With all of this security in place, what can possibly go wrong with the data?

The data itself is what can go wrong.  One of the points I made in this post is the possibility of spoofing sensor data to inject data that is not representative of the real world.   There can be some claim that sensors are within the security practices described above.   A secure sensor can place unique signatures on each observation so that we can be sure the sensor is trusted, the observation is current, and the sensor is working properly.    In the particular post, the sensors participate in an closed and isolated system with devoted resources for operating a transportation system.  It is conceivable that all of the sensors would have this level of security, although I am skeptical that this is the case.

Much of the modern exploitation of data uses data that is not so well trusted.   Even if an original observation is secure, that envelope assuring us of the authenticity of data typically does not follow the data to its ultimate use.  For example, a messaging service itself may confirm a particular message came from a particular person’s computing device with their current credentials and location.   However, data analysts querying that messaging service’s data warehouse will obtain query results without this back up information.    The data analysts assume that the security of the messaging service is sufficient to insure that only valid data was allowed in the data warehouse.    We need to make this assumption because the data we receive lacks this source identification information that would assure the authenticity of the data.

This kind of deferred trust that the other guy will take care of things is analogous to the trust we had in early days of personal computing.    As I got started with computing, I allowed myself the freedom to concentrate on writing some piece of software because I could trust that the computer would take care of details such as placing code on disk or executing the code in the order I intended.    I didn’t concern myself with the possibility that someone may trick the computer into placing different code on disk, or executing the code in a different order than I intended or originally confirmed through testing.    As an aside, I do recall imagining how such tricks could be accomplished but I allowed myself to believe we live in a civilized age where that would not happen.   I was mistaken.

In my recent technical experience, I was working with large amounts of data from a large number of widely distributed sources.    My work was on what I called hand-me-down data.   This was data from operational systems that the operational systems no longer needed.    The operational system may have been some embedded system where the entire data life-cycle for operations resides inside a single piece of hardware.   For operational purposes, there was a high degree of trust that the data was authentic during the period of time the operational system required it.    But for my purpose, I received only the data without this safety information.   I had to trust that the data was authentic without any evidence to check that authenticity myself.

In my experience, much of my data went through multiple hand-me-down steps.    I would receive data from someone who archived a set of data that he got from someone who retrieved the data from across a firewall where there was someone who caught the data being archived from the operational system.    Starting at the first step where the data was released from the operational system, the data lacked any kind of authenticity envelop.   I had to trust that the chain would not fail.

One of my larger lessons from my decade of experiencing daily data collections is that there seems to be no end to the number of ways that the supply chain can surprise us with new ways it can fail.

My project involved a lot of data.  Although this data was minimal to carry just enough information to feed my algorithms, that volume of data challenged our networking and storage capacities.   The individual observations did not have a separate envelop with a certificate of authenticity.   Such envelopes could easily be several times the size of the actual observation.    Nor was there an authenticity certification for any package of data delivered to me.   The authenticity had to be assumed.

Initially, it is easy to accept this implicit trust of authenticity.   The architecture, design goals, and implementation processes appeared to assure that the observations delivered to me from three or four hand-overs would be authentic observations of the real world.    For too many times to count, experience proved this to be almost childishly naive.

As an aside, a frequent theme of recent news is the appearance of some e-mail message presented exactly how we recognize emails to look like.   The message has some revelation that is supposed to inform us (usually to shock us).   I laugh when I see that.  There is nothing in that image that allows me to independently authenticate that email.  I may was well be watching CGI animated movie.   The information is worthless.

Unfortunately, nearly all information we work with at the data warehouse level is similarly worthless.   The data comes from some source that assures us that they authenticated the data is representative of what actually happened.  There is nothing in the query result itself that certifies this authenticity.   As I mentioned with my experience, it would be prohibitively expensive to carry this authenticity certification throughout the entire lifespan of the data.   Even with modern technology, it is would be a huge burden for some predictive analytic algorithm operating at high velocity take the time to check certifications of authenticity of each individual observation even if that certification were available.

So how do we trust the data?   In many of my earlier posts such as this one on the nature of evidence, I raised my worry that much of our recent investments in data have been largely untested for trust.   By testing for trust, I mean being subjected to the type of scrutiny we encounter in legal court cases.

Imagine some business that relies on data provided by another service that in turn compiles data from multiple and diverse sources.   Imagine that someone makes a legal case for some harm received by that business.    The business would have to defend itself.  That defense will inevitably focus intense scrutiny on the authenticity of the data.   The defense will be like that above mentioned news reports of an image of a printed email message.   The evidence is worthless for establishing the legitimacy of someone trusting the data.

In my project, I began to develop an assembly line approach for progressively refining data through a process of applying some categories to data and then summarizing those categories.   My motivation for building this successive refinement approach was to follow the divide and conquer approach to solve a complicated problem in multiple easier to understand steps.    Each step became its own little system with its own store of intermediate results.    As I described in this post, the effect was like a supply chain for manufacturers where the next step would use a product prepared by a separate business in a different location.   A later post elaborated on one of the unintended benefits of this approach in terms of compartmentalizing sensitive data.

Based on this experience and the above challenge of certifying the authenticity of hand-me-down data, there is an opportunity for a different business model for data that does not involve distributing data.   This business model would keep the data close enough to the source to assure authenticity to a level that can withstand the most diligent scrutiny.     Instead of distributing this data to paying customers, the customers would instead buy a service to have the source enrich the customer’s data with the source’s data.

The source’s data would never leave the source’s secure premises.   The customer can take whatever steps it needs to assure that its data was returned without modification except for the contracted enrichment.    This kind of business model involves commerce between different data-businesses.  The upstream data business does not sell this service to consumers.   Instead it sells to other businesses through contracted agreements for how data will be handled, enriched, and properly guarded.    The customer business will get only the enrichment that applies to his data with the assurance that this enrichment is authentic.

The customer business does not need to handle the more voluminous source data that includes not only irrelevant enrichment data, but also redundant or historical data that the source uses to validate the enrichment data.   If the customer takes on the task of using bulk data from an external source to enrich his data, that customer will inevitable demand access to historical or redundant data to cross-check the enrichment.   With a supply chain approach, this enrichment is performed local to the source to assure even the historical and redundant data is authentic.   The added benefit is that this extra data does not need to be transmitted, stored, and processed by the customer business.

I imagine that this approach must be occurring in certain markets.   A business that generates source data must see its data as a valuable resource.   The value of that resource depends highly on how well that resource can be trusted.    The approach of selling data in bulk and letting external parties use that data for their private enrichment purposes means the upstream data business loses control over protecting the integrity of its data.   Inevitably some customer of this data will misuse the data (for example, using out of date data, or data that has been corrupted after it left the source) in a way that will damage the reputation of that source data.   It makes sense to me that a good way to protect the value of the data is to make sure that data never leaves the source.   The source would take on the burden of enriching other company’s data with their own data.   The client company would only get the enrichment they contracted for and the source will assure that the enrichment is the most authentic possible.   Such a business arrangement will require some formal contract to assure trust that the customer’s data is properly safeguarded and that expected quality levels is maintained.

The above discussion describes a business-to-business supply chain model for data instead of the more popularly reported consumer based models where bulk data is delivered to individual companies to build their own independent implementation of how to use that data for enrichment.   The popular approach has high risks of some consumer company damaging the reputation of the source data because there was no way to assure that the data used was authentic data.


9 thoughts on “Data Supply Chain: data enrichment close to the source

  1. Pingback: Big Feedback: system dynamics when big data meets big audience | kenneumeister

  2. Pingback: Information supply chain is source of intelligible data for analytics | kenneumeister

  3. Pingback: Exposing model generated information for public scrutiny | kenneumeister

  4. The government’s freedom of information act (FOIA) may provide an existing model of enrichment at source, albeit using largely manual processes. In for FOIA request, the request must be specific about what information to release. Ideally, the government responds by delivering everything that matches the specific request but redacts everything that is not relevant to that request. The returned data undergoes a quality control check to be sure that the released information matches the request and excludes everything not related to the request. That control also will exclude relevant data that is exempted from FOIA disclosure.

    The government’s FOIA model was conceived around a paper-based model of request and response (although the results may be returned electronically via scanned PDFs). A modernization of the FOIA around big data concepts would be similar to the above discussion of enriching at the source.

    Both my discussion above and the FOIA requires the requester to be very specific about what data needs enrichment by the data owner. Also both allow iterations where future requests can be derived from information discovered in previous requests.

    The point is that both avoids the the technically simpler solution now commonly employed in big data systems: bulk release of entire source’s data.

    • The FOIA analogy is interesting. In government, the entity requesting the FOIA must identify itself that the government has an address to deliver the results. The nature of the requests and the identity of the requester usually is sufficient for the FOIA officer to have a pretty good idea of the intent of the queries. With subsequent iterations, the FOIA officer will learn where the research is going and what its conclusions will be.

      The FOIA model for protecting health care data would similarly expose the clinical researchers’ intent for using the data. The FOIA model demands specificity in order for the source to process the query. That specificity will dramatically narrow the possibilities for the intent of the researcher. Also, the source will have first view of that query result as he checks it for completeness and absence of privacy spillage. The source could conclude the research results before the researcher has access to the data. The probability of the the source scooping the researcher increases with iterations or recurring queries.

      This exposes a hidden agenda in the debates about anonymizing of patient data. The research needs a broad bulk transfer of anonymous data in order to keep his research intentions and progress secret until he is ready to publish. Perhaps the reason why there has been less investigation into FOIA type requests for the owner of the patient data to process is because of an implicit requirement that research secrecy trumps patient privacy.

  5. Pingback: Big data can re-identify de-identified data | kenneumeister

  6. Data enrichment at the source suggests employing big data technology at the source rather than central data warehouses. The original problem solved by data warehouses (later data lakes) was that the source lacked resources to retain their data indefinitely long after they had operational justification for the data. The data would be discarded. Data warehouses solved the problem by providing an alternative to total erasure of the data.

    The data warehouses took in salvaged data, data that is no longer relevant to the original purposes it was created and used for. Over time, analytics grew on the opportunity for mining this salvaged data. In particular to discover new questions and answers not previously anticipated when the data were first used. In my opinion the enthusiasm for big data is missing a key point that they are working from salvaged, hand-me-down, data for new purposes.

    Big data data scientists are asking questions of this salvaged salvaged data instead of presenting their questions directly to the source. I think this represents a very significant bias. It is likely that if they asked the source directly for these answers, they would get different answers than they would get from studying the salvage data.

    An analogy is like someone trying to find the personality of a particular resident by studying the household dust in the residence. The household dust includes scoffed off dead skin cells from that person. The skin cells contain DNA that could be used to (eventually) to find a lot about the person. But scoffed off skin cell DNA is unlikely to provide eye-witness testimony for an event that occurred recently. Today’s data scientist would try very hard to uncover that testimony from the data he has (DNA of dead skin cells) instead of asking the person directly for his eyewitness account. The latter task is inconvenient because the person is not living in the data cluster. To get that kind of answer, one needs to arrange to meet the person and talk to him.

    Another example is this blog here that I’ve been using as a kind of diary to record my stream of thinking of whatever interests me over time. Generally what has interested me is what I wrote previously so there is a kind of continuity in the entire body of the blog site. Reading this blog site may give an impression of what I am like as a person. I am certain there are bots that harvest the blog contents for purposes unrelated to building indexes for search engines. Many of my posts have been political in nature as I’ve explored fictional dedomenocracy and Dedomenocratic Party. There is probably some big data project somewhere (probably many) that are combining all the content of all the blog sites to uncover the political leanings of bloggers and how they may influence the upcoming election. If they are interested in finding my political leaning they may attempt to answer that question by studying the content of this blog. The algorithms may determine that I’m attempting to disrupt the political debate with some new ideas. The only one to come to that conclusion would be the algorithms and the associated analysts. The data they are working from is salvage data from a blog that is relatively reluctant about talking about present day politics. In fact, this blog is an escape from talking present tense politics. My point is that if someone is interested in learning my politics, they need to talk to me and be very patient about building up a trust to get me to open up. I’m pretty sure the answer from a direct query will be quite unlike what would be uncovered from analytics from my blog posts. (Especially this blog that contains a lot of noise from excessively long posts).

    The above blog post suggests a new data ecosystem where the source retains possession of all its past data and instead offers query services to external entities. With this ecosystem, all questions will always go back to the source. Perhaps 99% of the questions may be answered by querying past data, but the source will have an opportunity to see 100% of the questions being asked. The source will be able to intercede and explain that past data will be irrelevant to that question. The source may offer new data not previously offered, or simply decline to answer the question.

    In this new ecosystem, the downstream analyst has no other choice but to accept the present answer from the data source. The data for that question always remains securely in possession of the source and the source is offering the best query results he is willing to provide. The result is more tedious and more frustrating to the analyst, but it has the benefit of obtaining the most relevant answers to his questions. If the source provides an answer that answer will have the benefit of coming directly from that source where the source had an opportunity to think about the actual question and deliver an answer that best meets the objectives of that question.

  7. Pingback: Economy of compensated opinions in a dedomenocracy | kenneumeister

  8. Pingback: Reverse data governance, protecting source integrity | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s