In my last post, I described a difference in big data approaches between data warehouses (including those implemented with NoSQL databases) and forensic data tools such as SIEM (for IT systems).
Data warehouses put very large investments in imposing business rules and constraints on data before admitting the data into the data store. The data warehouse goal is to obtain the cleanest and most trusted data possible and that data is fully consistent with related data. The ideal end state of the data warehouse is to present a single version of truth to eliminate the burdens of argument about the data.
In contrast, large data projects for forensic purposes such as SIEM tools readily ingest all data as it becomes available. The resulting data store has as-is data measured that may have varying levels of credibility. The data store will include different data representing overlapping or redundant observations where these observations may be mutually incompatible. The data available for forensic analysis is minimally cleaned to assure persistent storage. There are minimal if any cross checks against rules or relational constraints.
In a classic business scenario, a group of records may represent orders, customers, and shipments. For the data warehouse, each of these are validated and collectively cross checked so that all data describing shipments are matched to data describing the associated order, and all data describing orders are matched to data describing the customer. For forensic data stores, all shipment, order, and customer data is admitted without cross checks. Shipments may not have matching order records and orders may not have matching customer records.
The title of this blog site expresses my interest on the discovery of new hypothesis. In earlier posts, I described how the introduction of models to supply missing data (my term is dark data) or to reject actual observations (forbidden data). I used the term dark data to allude to the cosmological concepts of dark matter or dark energy where accepted theory suggests something exists despite the fact we have no observations. I suggested this occurs frequently in the data cleaning process where we invent data records despite their absence from the accepted source.
An example of dark data in a data warehouse could be to create a phantom customer record to establish referential integrity with an otherwise orphaned order record. An example of something forbidden in a data warehouse may be a shipment that has no record of an order. Data warehouse practice treats both of these scenarios through business rules. The business rules will manufacture missing data or will reject forbidden data when there is appropriate justification and approval for the rules. In business rules, both scenarios may result in producing exception events that will trigger investigations into why a certain piece of data had to be invented or rejected. But in either case, the data available in the data warehouse is clean and consistent with the rules. The data warehouse presents the single version of truth which includes the single accepted solution for missing or unacceptable data.
In much earlier posts such as here, I explored how using models to invent substitute data for missing observations or to reject real but incredible observations could degrade the prospects for discovering new hypotheses. In effect, the prior assumptions bias the data itself to confirm those assumptions rather than to discover something that may challenge those assumptions. The goal of hypothesis discovery is to challenge pre-existing assumptions with new theories based on observations.
In my experience, I had a case where we estimated the load of a particular component. The assumption was that the component could never carry more than 100% of its capacity but in this case we regularly observed that this was happening. A reasonable business rule would be to replace the excessive measurement with on that is exactly 100% of the capacity, and generate an exception message that this occurred. With such a rule, the analysts would have good clean data to work with to observe that the component never did something that was supernatural. For this project, we followed a principle to present all data as measured and thus confront the analyst with this contradictory information. Eventually, this motivated an investigation that discovered that there was a method that enabled the component to carry a larger load in certain conditions. This was a discovered hypothesis that our assumptions were incorrect. This discovery would have been hampered if we had imposed the business rule to truncate the value match the assumed limit. In this example, a parallel data project did apply this rule and denied its analysts this insight to motivate an investigation: their single version of truth from cleaned data turned out to be wrong.
Recently there has been several reports of so-called disruptive companies who have found new business models that rapidly and effectively compete against long established businesses. In the 1990s, we called the same disruptions as thinking outside the box. The concept of disruption is not new. Inherently disruptive is any discovery of a new hypothesis that later prove to be more powerful than older hypothesis. I don’t know all of the details of the genesis of the disruptive businesses, but I suspect some may be based on just lucky guesses. Even though my discussions focus on discovering hypotheses from observed data, a discovered hypothesis is a guess that might get lucky. The availability of observed data to suggest a hypothesis simply improves the odds.
As an aside, I still treat a discovered hypothesis as a guess that needs testing either through a formal experiment designed specifically to test the hypothesis or through a gamble to put it into practice and see what happens. Even when supported by data, either of these two methods may prove the discovered hypothesis to be wrong.
While guesses can be wrong, the evidence of disruptive businesses is proof that all businesses should pay attention to the value of making guesses that challenge assumptions. Businesses need tools to allow them to discover new hypothesis that may contradict their established business models.
The discovery may lead to adopting the changes before others do. For example, the recent exploitation of smart phone and automated billing for rides could have been discovered and exploited by established taxi companies to at least blunt the impact of the new comers such as Uber or Lyft.
Alternatively, the discovery may lead to improving the business model to be more attractive despite the disruptive alternative. For the above example, the taxi companies could have lobbied earlier for laws to prevent the start-ups from gaining momentum in the first place. (This controversial response balances the ride-sharing scenario realistically but it could have started much earlier).
There is reason to respect the need for a capability to discover disruptive hypothesis, and a need to remove impediments from such efforts. As I mentioned earlier, the imposition of business rules on data hampers the hypothesis discovery process by biasing the data to conform with existing assumptions about the business.
Certainly, at some point we need to impose business rules on data. Key executive level decisions need to be made based on the best and cleanest data available. The decisions need rapid consensus made possible by the single version of truth that is made possible by the imposition of business rules and data governance. However, these decisions ideally must use exclusively tested hypotheses, not discovered hypotheses or guesses.
At an operational or planning level, there is a need for discovering new hypotheses (guesses). These lower groups can then subject the hypotheses to tests in order to qualify the hypothesis for executive decision making. But, these lower groups need access to dirtier data to see observations that may contradict the business models. These lower groups needs access to data that presents multiple versions of the truth.
For example, it is possible that a business model may start off with model of selling directly to consumers. Over time, one of those customers may decide to resell the products. (One vendor described this as a grey market to distinguish it from an illegal black market.) The imposition of a rule that insists that a order must have exactly one customer may present a problem when discovering complaints from non-customers who are clearly in legitimate possession of the product. For example, a business rule for a help-desk call center may be to reject any complaints for anyone who is not on the list of known customers. This rule may be satisfactory from a business sense, but it would blind the company to the fact that a grey market exists and may be growing. From a business sense, there is a need to restrict access to the help-desk to optimize their resources on solving legitimate customer complaints and to avoid divulging potentially sensitive support issues to competitors. But also from a business sense, they may be miss the opportunity to coordinate more closely with the grey market to improve this channel in a mutually beneficial way.
For back end data systems such as data warehouses that receive data from remote sources that are closer to the observations, there is a need for access to both the cleaned and the uncleaned data. Over my experience, I developed what I later called a data supply chain or assembly line model where the ultimate goal is a data warehouse of clean data consistent with all business rules applied. The supply chain model however has intermediate stores of progressively cleaned data where the very first store accepts all data as is, similar to the forensic tools such as those in the SIEM market.
In the data supply chain, each intermediate store uses data technologies that have state-of-the-art storage and retrieval capabilities. At each store, the analysts have comparable capabilities for querying and analyzing the data. They are not hampered by more primitive technologies simply because their efforts occur before the final cleaned data is available.
The analysts at intermediate steps of the supply chain have different objectives. In the middle of the supply chain, their focus is on cleaning the data and investigating any exceptions that occur. The exceptions are problems with the data, problems that make the data dirty. The problems may be missing data (needing substitute placeholders) or unrealistic data (needing modification to make it more realistic). However, the analysts at these intermediate stages will have the same efficient query access to the problematic data as the final data warehouse analysts will have for their clean data. The efficient investigation of problematic data can lead to hypothesis discovery.
The concept of a data supply chain of intermediate but well supported data stores incorporates a principle of delayed binding of business rules. The unclean data is immediately made available in some type of storage that enables efficient query of the data.
In my past project, that store was a database engine. The initial and intermediate steps resided in separate database schema that permitted isolating certain steps to certain users. The final step constituted the final data store of the best available data for the domain specific analysis. The design of the system provided the same technologies for all of the preceding stages. For example, many implementations of ETL (extraction, tranformation, and load) involved SQL queries from a table in one schema to a table in a different schema. Each stage had access to the same effective technology.
A frequent complaint about this approach is that the databases have dirty data. This is true. The data supply chain exploits the effective database technologies to perform the data cleansing and business rules operations. In contrast to writing custom software code to perform the cleansing, the database approach avoids the software algorithm development to optimize algorithms for working with retrieved bulk data from a data store. More importantly, the database approach offers the opportunity to exploit readily available reporting and ad-hoc querying to monitor and adjust the data cleansing process. Databases can efficiently cleans dirty in one of its schema. The distinction is that there is one final schema that is available for end users that will have the best clean data that they need to perform their duties investigating the single accepted version of truth.
In an earlier post, I proposed using the term Crowd Data to describe a broad area of interest within Big Data. I proposed Crowd data to describe the unstructured and as-is data that is collected from outside sources. Like crowds, the data are unruly and lack consistent interpretations. This is the data that enters the data supply chain. This raw crowd data is what makes possible disruptive discoveries of new hypothesis because this data has no bias of preconceived assumptions.
The data supply chain incrementally applies business rules and handles exceptions until the final data set is available for trusted application of automated decision making using machine learning or other forms of predictive analytics. The final data is large enough to earn the title of Big Data.
Crowd Data is as-is data received from various sources. Big data is cleaned data consistent with all business rules to be eligible for committing business decisions based on its analysis.
My previous post defended the parallels need for both data warehouses and forensic databases (such as network management SIEM databases). This parallel could be described as a need for both Crowd (dirty) Data and Big (clean) data. Crowd Data is the easily accessible data before the imposition of business rules. Crowd Data presents opportunities for multiple versions of the truth. Crowd Data provides the opportunity to discover disruptive hypotheses.