In many of my posts, I expressed a concern about dark data lurking in big data of historical data. I used the term dark data to describe model-generated data, or invented data we assume to be true, to fill in gaps in the historical data. There is no observational basis for dark data, but instead some model projects its own values based on other observed data. The alternative to dark data is actual observed data that I describe as bright data when the data is well documented and carefully controlled. I admit that most observation data is not perfectly unambiguous and I call that dimmed data. Even when compared to dimmed data, dark data deserves special scrutiny.
One of the main goals of analysis of big data is to discover new hypothesis. The more highly valued discoveries are hypotheses that surprise us, something we never considered before. One of the assumptions in using historical data is that it represents actual observations that will permit discovering new ideas. This is where dark data gets in the way. Dark data is propagating old hypotheses and biases into the data set that we want to use to find new hypotheses. Dark data in historical data ends up confirming old hypotheses because it is data generated by those same hypotheses.
Dark data is inevitable, it just needs to be identified and scrutinized carefully to be sure we continue to trust it.
In these posts, I’m describing a near obsession with this idea of dark data and why we should be concerned about it. I’m just describing my personal experience. I worked large data sets consisting of both large volumes and multiple unrelated sources. And although the results of my work were frequently appreciated, I was criticized for not automating more of the labor intensive processes. Talking about dark data is a way to explain why there is some lower limit of manual labor that can not be automated. We need to always be wary that we can at any time be surprised by new data that invalidates old assumptions or models.
I also described how this is not unique to data science. I claim this is actually a very old and highly sophisticated field of study that I describe as history sciences (history, archaeology, anthropology, paleontology, geology, astronomy, etc). Historical data is history. It has the same data problems as any other forms of history: poor observations and invented observations. These disciplines are very labor intensive because they involve practice of rhetoric: constructing, presenting, deconstructing, attacking, defending arguments about both observations and theories. In these older disciplines, it is taken for granted that these arguments go on continuously. Big data is historical data is history. It is labor intensive.
But there is a different reason why I’m so suspicious about dark data. Before I was introduced to data science (or even before I know what SQL was) I practiced simulation and modeling. Simulation and modeling is all about inventing data in the context of imagining future scenarios to answer what-if type questions. There can be no observational data to compete with this type of computer generated data. Simulation and modeling generates data in large part because there are no possible observations from future scenarios.
I very much enjoyed my prior experience practicing the art of simulation and modeling. Within my experience I saw two general attitudes of the advocates of simulations and modeling.
The dominant group was the ones who felt that if done properly, simulation and modeling can generate results you can trust, you can take to the bank. This makes sense because the primary reason to budget for simulation and modeling is to support decision making. Decision makers expect confidence from their simulation and modeling advisers. What sets this dominant group apart from the minority group is the insistence that truth is inevitable if the processes were correctly followed.
I saw myself in the minority who needed more than just rigorous adherence to protocol to become confident. I fully agree that the results of the simulations need to be presented confidently. However, I felt that confidence had to be earned case-by-case. Even if all of the models are constructed and individually tested to high standards, and even if all of the input statistics met high standards, I still need to be convinced the simulation as a whole produces something of confidence. Simply adding confidence intervals does not help, it just transfers the asserted confidence to the width of the intervals. I approached simulation and modeling as something can never automatically make decisions or even automatically advise a decision makers. There is an inevitable man in the loop. Even a rerun of an old proven model begs for a specialist review of the results. More specifically, decision makers want human advisers who can give their word that the results are good enough for basing a decision.
This is not a claim of some higher ground or superiority. I was sufficiently humbled by the quality of work by the experts in the dominant group. Instead this is a description of my personal attitude toward simulation and modeling. It is an attitude that says simulation and modeling is always a work in progress. This is a self-justification of someone who enjoys the practice as if it were a form of entertainment or play. An attitude that demands each time to be convinced of the results is like the attitude a person playing some game for multiple iterations in order to try to increase the score or get to another level. There is an element of fun in my attitude.
My entry into big data and analysis was from the simulation and modeling perspective. The job opportunity was to use my simulation and modeling experience to transform historical data into a form that can feed an existing simulation and modeling tool. Although I regretted not being on the side of the simulation and modelers, I found new challenges with the data. In particular, the data was far from perfect for what I knew a simulation model expects. Part of the play of scrutinizing simulation models is scrutinizing the input data. Well, now I had enough input data to keep me busy for over a decade.
Although I ended up developing custom tools that organized big data into navigable reports of aggregated multidimensional data, I did so from the perspective of preparing data for use in simulation and modeling to support high-visibility decision making. I was less interested in the technology of the data tools and much more interested in the data itself. The data was the challenge, not the tools.
The challenge was not obvious because for the duration of the project neither the basic concepts of the simulation or of the data sources changed. Over the years, the basic definitions of those concepts changed. For example, even though the data is still received in the same old form and through the same channels, the content of that data is coming from a new type of device, or that data is coming from an old device measuring a new phenomena. My point here is that the problem is that the world keeps changing, and it keeps surprising us. The problem is not that there is a request to change the assumptions, but instead the problem is that new data is demanding that old assumptions be changed.
So far, I’ve been describing data that could be considered observation data of various levels of brightness or dimness. At the other end and on a different system is a simulation tool that generated artificial data for what-if analysis.
The dark data is introduced into the input side of this equation often without too much consideration.
Observation data is troublesome stuff with all kinds of problems. Data may be missing. I want to relate a real debate but described here in analogy to make it easier to understand. Assume there is an in store receipt of with a product number including the serial number but it is of an anonymous person paying cash. Then this same product and serial number is observed much later on an auction site but this time with some identification of the seller. Because it is at least highly likely that the seller is the same as the buyer, why not just fill in the buyer’s information with the identification of the seller? The actual debate concerned something more subtle but this exaggerates why this is a bad idea because there is an alternative explanation of the seller selling second-hand or even stolen property. This is dark data and it is often very tempting to use as a substitute of an observation. Yesterday’s post about who was tipping over smart cars was another example of eagerness to fill in the missing observation.
There are other ways that observation data can introduce difficulties. The data may not be in a form that can be related or compared to other data. The data may overlap with different observations of same property. The data may present multiple properties that are contradictory (for example, a deceased person may show up to vote).
On the other hand, we have a lot of confidence in our simulation models and that gives us confidence in the individual components of those models. We are encouraged find a way to take advantage of that confidence by reusing those modeling components to transform or filter the observational data. This is often not even an explicit act. Data handling software is built to specified requirements that come from careful consideration of realistic tests or transformations that happen to draw on the same knowledge used for the tools we use to create artificial data.
To make big data work, we transform observation data into artificial data that has observational data as ancestors. This is exactly what simulation and modeling does. From my perspective, big data is just a different way to implement a simulation. It has the same implication in terms of continuous demand for labor to evaluate the data to be sure it remains relevant to the real world. The danger is that unattended big data projects can drift into fantasy worlds useless or dangerous for decision makers.
When I started building the data solution for my client, I didn’t think of it as a big data project. I thought of it in terms of preparing input data for a simulation model. Input data preparation is a subtopic within simulation and modeling. From the very earliest computer simulations, the biggest problem was that there was always a lot more observational data than a model can handle. Either the data was too voluminous and needed to be reduced, or the data was too messy and needed to be tamed. In the earliest days of simulation and modeling, we solved both problems by reducing observational data into statistical models. This replaced observational data with a representation of the observation but where that representation is easier to work with.
Now that we have technologies that do not demand such draconian data reduction, we rely much less on statistical models. But we still need to replace observational data with a representation that is easier to work with. That representation is a model of the observation and not the observation itself. Replacing observational data with an easier to use representation of those observations is the domain of simulation and modeling.
Simulation and modeling is labor intensive just like historical science and for similar reasons. Since the project is replacing observed observation data with representations of those observations, the product is still a representation of historical data. We reserve the word simulation for inventing possible future observations. However, the simulation project inherently involves a sub-discipline of transforming observational data into a representation that is easier to use. Both big data and simulation share that feature.
To use an evolution metaphor, big data and simulation have a common ancestor. I aspire to work in that common ancestor discipline.
As I mentioned already, I stumbled into the big data project from a progression out of simulation and modeling. There was more to this stumble than seeing a need to prepare data for simulation purposes. I was already disillusioned about what I thought was the unreasonable demand to make data fit a simulation. Because the simulation needed small and carefully prepared data, there was a need for severe data reduction. Data reduction means throwing away information. I accept that a particular simple statistical model can adequately describe the observations for the purposes of the simulation. But I regretted the lost opportunity of the information that was thrown away.
I entered the task of working on the data side with a motivation to find a way to reverse the game. Instead of making a model and then transforming the data to be friendly to the model, start with the data and then customize a model to be friendly with the data. Long before I saw the parallels of my project with the larger big data trends, I was thinking that I was customizing models to be friendly with the data.
I came up with a motto of “use all available data”. I asserted two simultaneous meanings for this phrase. Make models that use all available data, and don’t make models that demand data that is not available. This is an argument directed at the simulation and modeling perspective. It just happens to be an exotic restatement of the objectives of big data projects.
However, I did go a step further to use exploit these data tools to produce new kinds of what-if analysis by using tables that can transform observational data into some projected future scenario. I set this apart from the observational data, but I was confident that this approach can compete directly with simulation tools for what-if analysis. I’m sure this is not unusual in big data projects, but there remains a conviction that simulation models that demand data reduction offer something that big data techniques can’t match. I tried to show that data technologies can replace older simulation practices in many ways.
In short I keep dwelling on dark data because I recognize it as simulated data that I recognize as a discipline that treats simulated data with a good deal of caution and suspicion. In other words, it is a personal thing.