In an old post, I attempted to make an analogy of today’s data with the earlier days of computing and networking before the realization that some participants may not be benevolent players.
The early data of personal computing and Internet standards were based on an open approach to maximize the adoption of the technology by keeping things simple so as to not require a lot of training or skills to begin to participate, including to contribute new value by writing new software. This approach was successful in popularizing these technologies. I imagine that today’s ubiquitous computing at the consumer level would not have developed without this initial openness. Underlying that openness was the presumption of civilized cooperative behavior. We assumed a contract that people would behave themselves.
Over a period of two decades we were repeatedly shocked that someone out there would not be so neighborly. In fact, we found that there were a substantial population of people who readily devoted their energies to exploiting the low barriers to impose unintended costs on the cooperative community. These exploiters were not only intent on gaining unfair advantage but also to inflict outright harm to the gentler community.
In my post, I attempted to draw a parallel with our current fascination of freely flowing data as being similar to that earlier era of freely open technologies. The difficulty of conveying that analogy comes from the fact that the earlier vulnerabilities involved hardware: computers, processors, memory, network bandwidth, network hardware, etc. The vulnerabilities resided in the software but the targets were hardware.
With data, the nature of the vulnerability location and target is fundamentally different. The vulnerability resides in data instead of software, and the target is data instead of hardware. Data needs to be protected because there are highly motivated actors who will eventually find ways to exploit it to gain unfair advantage or to inflict widespread harm on others.
The difficulty with making this case is that we have become so accustomed to thinking about information security in terms of hardware and software. We have developed elaborate systems for protecting computing centers, networks, and software with multiple levels of defense. These defense systems attempt to assure us that the use of the infrastructure is limited to authenticated authorized users who will be traceable to non-repudiable log information so that we can quickly quarantine malevolent actors. With all of this security in place, what can possibly go wrong with the data?
The data itself is what can go wrong. One of the points I made in this post is the possibility of spoofing sensor data to inject data that is not representative of the real world. There can be some claim that sensors are within the security practices described above. A secure sensor can place unique signatures on each observation so that we can be sure the sensor is trusted, the observation is current, and the sensor is working properly. In the particular post, the sensors participate in an closed and isolated system with devoted resources for operating a transportation system. It is conceivable that all of the sensors would have this level of security, although I am skeptical that this is the case.
Much of the modern exploitation of data uses data that is not so well trusted. Even if an original observation is secure, that envelope assuring us of the authenticity of data typically does not follow the data to its ultimate use. For example, a messaging service itself may confirm a particular message came from a particular person’s computing device with their current credentials and location. However, data analysts querying that messaging service’s data warehouse will obtain query results without this back up information. The data analysts assume that the security of the messaging service is sufficient to insure that only valid data was allowed in the data warehouse. We need to make this assumption because the data we receive lacks this source identification information that would assure the authenticity of the data.
This kind of deferred trust that the other guy will take care of things is analogous to the trust we had in early days of personal computing. As I got started with computing, I allowed myself the freedom to concentrate on writing some piece of software because I could trust that the computer would take care of details such as placing code on disk or executing the code in the order I intended. I didn’t concern myself with the possibility that someone may trick the computer into placing different code on disk, or executing the code in a different order than I intended or originally confirmed through testing. As an aside, I do recall imagining how such tricks could be accomplished but I allowed myself to believe we live in a civilized age where that would not happen. I was mistaken.
In my recent technical experience, I was working with large amounts of data from a large number of widely distributed sources. My work was on what I called hand-me-down data. This was data from operational systems that the operational systems no longer needed. The operational system may have been some embedded system where the entire data life-cycle for operations resides inside a single piece of hardware. For operational purposes, there was a high degree of trust that the data was authentic during the period of time the operational system required it. But for my purpose, I received only the data without this safety information. I had to trust that the data was authentic without any evidence to check that authenticity myself.
In my experience, much of my data went through multiple hand-me-down steps. I would receive data from someone who archived a set of data that he got from someone who retrieved the data from across a firewall where there was someone who caught the data being archived from the operational system. Starting at the first step where the data was released from the operational system, the data lacked any kind of authenticity envelop. I had to trust that the chain would not fail.
One of my larger lessons from my decade of experiencing daily data collections is that there seems to be no end to the number of ways that the supply chain can surprise us with new ways it can fail.
My project involved a lot of data. Although this data was minimal to carry just enough information to feed my algorithms, that volume of data challenged our networking and storage capacities. The individual observations did not have a separate envelop with a certificate of authenticity. Such envelopes could easily be several times the size of the actual observation. Nor was there an authenticity certification for any package of data delivered to me. The authenticity had to be assumed.
Initially, it is easy to accept this implicit trust of authenticity. The architecture, design goals, and implementation processes appeared to assure that the observations delivered to me from three or four hand-overs would be authentic observations of the real world. For too many times to count, experience proved this to be almost childishly naive.
As an aside, a frequent theme of recent news is the appearance of some e-mail message presented exactly how we recognize emails to look like. The message has some revelation that is supposed to inform us (usually to shock us). I laugh when I see that. There is nothing in that image that allows me to independently authenticate that email. I may was well be watching CGI animated movie. The information is worthless.
Unfortunately, nearly all information we work with at the data warehouse level is similarly worthless. The data comes from some source that assures us that they authenticated the data is representative of what actually happened. There is nothing in the query result itself that certifies this authenticity. As I mentioned with my experience, it would be prohibitively expensive to carry this authenticity certification throughout the entire lifespan of the data. Even with modern technology, it is would be a huge burden for some predictive analytic algorithm operating at high velocity take the time to check certifications of authenticity of each individual observation even if that certification were available.
So how do we trust the data? In many of my earlier posts such as this one on the nature of evidence, I raised my worry that much of our recent investments in data have been largely untested for trust. By testing for trust, I mean being subjected to the type of scrutiny we encounter in legal court cases.
Imagine some business that relies on data provided by another service that in turn compiles data from multiple and diverse sources. Imagine that someone makes a legal case for some harm received by that business. The business would have to defend itself. That defense will inevitably focus intense scrutiny on the authenticity of the data. The defense will be like that above mentioned news reports of an image of a printed email message. The evidence is worthless for establishing the legitimacy of someone trusting the data.
In my project, I began to develop an assembly line approach for progressively refining data through a process of applying some categories to data and then summarizing those categories. My motivation for building this successive refinement approach was to follow the divide and conquer approach to solve a complicated problem in multiple easier to understand steps. Each step became its own little system with its own store of intermediate results. As I described in this post, the effect was like a supply chain for manufacturers where the next step would use a product prepared by a separate business in a different location. A later post elaborated on one of the unintended benefits of this approach in terms of compartmentalizing sensitive data.
Based on this experience and the above challenge of certifying the authenticity of hand-me-down data, there is an opportunity for a different business model for data that does not involve distributing data. This business model would keep the data close enough to the source to assure authenticity to a level that can withstand the most diligent scrutiny. Instead of distributing this data to paying customers, the customers would instead buy a service to have the source enrich the customer’s data with the source’s data.
The source’s data would never leave the source’s secure premises. The customer can take whatever steps it needs to assure that its data was returned without modification except for the contracted enrichment. This kind of business model involves commerce between different data-businesses. The upstream data business does not sell this service to consumers. Instead it sells to other businesses through contracted agreements for how data will be handled, enriched, and properly guarded. The customer business will get only the enrichment that applies to his data with the assurance that this enrichment is authentic.
The customer business does not need to handle the more voluminous source data that includes not only irrelevant enrichment data, but also redundant or historical data that the source uses to validate the enrichment data. If the customer takes on the task of using bulk data from an external source to enrich his data, that customer will inevitable demand access to historical or redundant data to cross-check the enrichment. With a supply chain approach, this enrichment is performed local to the source to assure even the historical and redundant data is authentic. The added benefit is that this extra data does not need to be transmitted, stored, and processed by the customer business.
I imagine that this approach must be occurring in certain markets. A business that generates source data must see its data as a valuable resource. The value of that resource depends highly on how well that resource can be trusted. The approach of selling data in bulk and letting external parties use that data for their private enrichment purposes means the upstream data business loses control over protecting the integrity of its data. Inevitably some customer of this data will misuse the data (for example, using out of date data, or data that has been corrupted after it left the source) in a way that will damage the reputation of that source data. It makes sense to me that a good way to protect the value of the data is to make sure that data never leaves the source. The source would take on the burden of enriching other company’s data with their own data. The client company would only get the enrichment they contracted for and the source will assure that the enrichment is the most authentic possible. Such a business arrangement will require some formal contract to assure trust that the customer’s data is properly safeguarded and that expected quality levels is maintained.
The above discussion describes a business-to-business supply chain model for data instead of the more popularly reported consumer based models where bulk data is delivered to individual companies to build their own independent implementation of how to use that data for enrichment. The popular approach has high risks of some consumer company damaging the reputation of the source data because there was no way to assure that the data used was authentic data.