In my earlier post, I described a multiple step approach to managing data through successive aggregation where each aggregation had its own storage, processing, reporting, and labor resources. At first I would call this a data life-cycle, or a data assembly line. After some more thought, I think a better name for the concept is an Information Supply Chain.
I made an analogy to a chain of manufacturing from extracting raw materials from the earth, refining that material into bulk stock, milling that stock, manufacturing into a product, finishing the product, and finally to distribute and sell the product to a consumer. In that analogy, each step was its own business with its own capital resources, processes, regulations, and trained labor. The process works efficiently because each business can focus on its core competency.
This is an analogy to how I designed my project. Even though I had a defined set of data, I allocated different resources for storage, processing, reporting, and labor (tasks) for different steps of processing that data. Just like the manufacturing scenario, each step works with a different level of refinement of data where refinement is from mapping data to categories and aggregating the categories. The local step would work with aggregated data from a previous step, prepare its own version of the aggregated data, and then deliver it to the next step. That local set would have its own set of tools to manage its process to address issues unique and specific to that step. Through scheduling of the work week, staff had specific time periods to work on particular steps when they could focus on the specific concerns of that step.
From the start of the project, these steps used different resources, often completely different servers, to do their tasks. Coincidentally, this design proved to be very robust because a single fault would only stop one part of the chain and the other parts could continue to work, storing results until the fault was corrected. However, the design was driven more by a divide-and-conquer approach to solving complex data quality issues. It was just simpler to comprehend and solve the issues when the processing was segmented and isolated from the the rest of the system.
As I mentioned in the previous post, I was challenged to defend this approach for its costs compared to a simpler architecture of a single storage and reporting architecture for all of the steps. I explained some of my arguments in that post. I could have added the ten year track record of its robust design permitting it to coast through some major disruptions.
For today’s post, I want to discuss another argument concerning compartmentalizing the processes and the data.
In recent years there have been a number of cases of deliberate or accidental unauthorized disclosure of sensitive data. That has led to increased attention to protecting against the so-called insider threat. Any individual inside an organization with access to information could betray the organization’s trust in him by disclosing information. Recent technological advances permitting consolidation of data into a single repository has expanded the population within the organization who potentially have access to this information. This increase in population with potential access to information increases the likelihood of an unauthorized leak of information. In addition, if a leak does occur, then there is a large population of potential suspects. Initial investigation of a leak may consider the entire organization as suspects.
One of my early motivations of dedicating different resources to different processes was due to help isolate disruptions due to information assurance security issues. If a particular software or system configuration needed to be adjusted, often it could be isolated to just a subset of the servers that used that configuration. This also allowed configuring different operating systems to shut down services not needed on one server even though that service might be required for another. At the time, we had separate physical machines. Now with virtual or cloud-based computing, this benefit can be realized without dedicating separate hardware for each step. Many of the steps only required short periods of utilization of the server and the steps are staggered in time so that they could effectively share the same physical resources.
However, my larger motivation was to compartmentalize the information. The more sensitive data can be isolated to earlier steps that provide less sensitive (and more relevant) aggregate data to later steps. Labor for each step can concentrate on just that aggregate information needed at that step. All irrelevant data is on completely different systems or data stores. As it turned for this project, the processing of the most sensitive information required the least amount of labor. The remaining processes that required more labor used less sensitive summarized categories of data.
This approach was not very innovative. I basically copied the old way of working with data across different departments of a business. It was done that way mostly because limitations of earlier technologies required each department to build its own data system. Data would still need to be shared across an organization. This sharing was done through prior agreements to define the needed summary reports to exchange.
For example, a marketing department may have its own systems to manage marketing activities but need information from sales department. The sales department would generate summary reports from their system and deliver that summary to the marketing department. This approach offers two inherent compartmentalization benefits. First, the data delivered to marketing is customized to specifically meet the marketing needs. Second, the marketing department is not burdened with handling unneeded and sensitive sales information. The marketing department can do their job with data that is strictly for marketing. The sales department can do their job with data that is strictly for sales. The two can share information through formal agreements about what needs to be shared.
In the past decade, many organizations have replaced these isolated solutions with consolidated solutions using a single consolidated data store, or enterprise data warehouse. This initiative promised a number of benefits of saving costs, speeding up data sharing, and allowing for more real time access to data for decision makers. The downside was that this central data warehouse contained all possible data, including sensitive data that should be accessed by a select few people. Compared with older approaches that stored information on separate isolated systems, this sensitive data exposure became a larger problem when all the data was in a single system.
The technical solution to protecting this sensitive data is to use access control technologies on the data and on separate sets of reporting tools. The access control could limit access to specific tables, or particular rows, columns, or cells of the table. The reports could be managed through an approval process that assures that the reports used by a certain group would only access the data appropriate for that group.
With all these controls, it is still possible that someone can accidentally encounter sensitive data. As a result, it became necessary to authorize the entire staff to be have training and authorization to be trusted to not mishandle this information. Even if their duties did not require use of that data, they could be exposed to it.
With the consolidated data warehouse approach with data-level access controls, there is an additional problem with the need to access non-sensitive aggregations of sensitive data as described by my example of the marketing group needing comprehensive summary of sales data. For example, the marketing group needs a precise count of customers in different geographic regions but the identification of addresses of certain customers is denied to users of the marketing group. They may run run a query to count customers in regions and get a result that is incomplete because they didn’t have read access to specific records. They didn’t need to know the identification of the customer, but they needed an accurate count of the customers. If they ran the query they may not be aware that their results are incomplete. Alternatively, there may be confusion with sales and marketing reports present completely different numbers for the same query. There are technical solutions to solve this problem, but sometimes the problem is not discovered until after it has already caused confusion and when a technical solution is costly to implement.
There is a need for role based access control to data. Defining roles for a large organization is complex. As in my example, the sales and marketing may have their own internal data where they would divide their internal staff into data readers or data writers, for example. The problem comes with defining specialized roles for groups outside of the department that needs some limited access to their data. The marketing team needs some kind of read access to the sales data, but not the same kind read access that sales team has. The warehouse and manufacturing departments may also require access to sales data but with different kinds of limitations. There can be an explosion of roles to keep each group constrained to just the information they need.
Some large enterprises are understood well enough and are stable enough to manage an exhaustive master directory of such roles. It becomes more difficult for smaller projects subject to more dynamic changes. For these projects managing extensively precise roles is impractical.
One approach is to consider the mission as a unit of compartmental information. All of the levels of abstraction or detail of data is available to everyone in the mission, but this data is protected from access by anyone outside of the mission. This allows for a very simple set of roles that relate only to the internal participants of the mission. But it does require everyone to be qualified to handle any of the mission’s data even if their job duties don’t require handling more sensitive data. Everyone in the mission needs the same level of trust no matter what their job role is. This could have a beneficial effect of building a stronger sense of shared teamwork. The downside is that there is a higher eligibility requirement that increases costs or may result in lengthy vacancies.
Another approach is to break the project up the old fashioned way with separate systems to handle different levels of data and assign staff to only the systems that holds the data necessary for their job duties. This permits a small set of easily managed roles for access control for specific subsystems. The isolation of different types of sensitivity is done by limiting log-in access to those separate systems containing that information.
With recent availability of virtual machines and cloud computing, this kind of multiple step approach can be cost competitive compared to a single system approach. The different virtual machines can share the same hardware.
I prefer a multiple step approach for managing the data project where the different steps are like a manufacturing supply chain with separate businesses to handle specific steps of the supply chain. It allows for specialization at the different levels of data abstraction and for optimization for quality control and protection of information. I also think the better defined and constrained jobs and duties are easier to staff and manage. This approach does add some additional burden because of the need for so many separate servers, and this was a significant concern in the past. But with modern available virtual machines and cloud computing, this is less of an issue.