I am finally getting around to following on an earlier post, where I suggested I should elaborate what I mean about exposing model for data analysis use. On this blog, I frequently discussed the distinction between model-generated data and observed data. The latter I gave the name as bright (or dim) data to capture the idea that it exposes something objective about the current reality. In contrast, I assigned the name dark data to model-generated data.
Although my usage is different from more accepted data science definition of dark data, I continue to prefer my definition. Assigning the term dark data to model generated data is very satisfying. Initially, I chose the term to correspond to cosmology’s use of dark in dark matter and dark energy. Both dark matter and dark energy are model-generated data that substitutes for missing observations. Both are examples of dark data. I later began categorizing data into a taxonomy using the metaphor of light. In this taxonomy, dark nicely suggests the absence of light matching the fact that we using a model to substitute for what we failed to observe directly. I am also intrigued by the connotation that dark shares with the dark arts: the mystical or the supernatural world. Data from models is an observation from an imaginary world where our theories are unquestionably true.
I don’t appreciate having model-generated data in my data stores contaminating observational data. At the very least, I want to segregate the model data in a different schema so its non-observational status is always obvious.
The real problem with model-generated data is that rarely gets populated in databases. Mathematical equations can compute model-generated data quickly through query formulas or post-processing procedural languages. Since it is easy to compute the data, it has long been a practice in data projects to use algorithms to compute the values only when needed so that they do not occupy space in the data store. This economy was essential in early years when storage was expensive.
Lately, the cost of storage has become less of an issue. At the same time, security concerns recommend securing human-readable source code in secluded development tiers. This makes the models inaccessible from the production tier that hosts the observational data. The algorithms still produce the model-generated data as needed, but production-side analyst has no access to study those models. Even if he had access to the models, it will still be tedious to reconstruct the model-generated data to make it available for comparison with observed.
I had these concerns in my last job. I solved the problem to my satisfaction by a pre-computing all of the possibly needed values from the model and then stored the results in a permanent table. Then when I needed the model-data, I joined the two tables instead of computing the values on the fly. Say for example I had a model that generated a result based on two variables X and Y. I would query the data for all possible combinations of X and Y that could be used in the calculation. I then stored these values in table with additional columns representing the computed results. Then whenever I needed the model data, I would join the two tables matching the X and Y values appropriately.
The design was unpopular. As mentioned above, it departs from historic practice that considers it wasteful to dedicate storage space for values that can be computed as needed. In addition, it is likely that only a tiny fraction of the combinations of X and Y would even need calculation because usually queries only sample a small part of the possible space. The precomputation of all observed parameters represented an additional CPU load. Even though I had sufficient storage space and processing capabilities, this was hard to justify to the developers or even the product owners.
The advantage of this approach is apparent primarily to the middle-man, the analyst running queries on the data. As part of the due diligence of data analysis, the analyst needs to be able drill down to see the underlying data for a particular analysis. As noted above, modern security practices hide the software details from production tier so if model-generated data only existed in procedural code, the analyst would never be able to see its contributions. Having the model data materialized in tables in the data store, the analyst is free to query the model data specifically and study its contribution to his final analysis.
I quickly found another use of the materialized model data. I could explore the entire universe of relevant model data matching my observation data. One use I found was to generate dashboard summaries that provided quality control opportunities by comparing the model-generated tables from one day to the next. One simple example may be that on typical days there was a certain distribution of computed values for the distinct X and Y values observed. When that distribution of computed values changed, I would start an investigation to explain the change. That investigation often lead to discovering a problem with the data or perhaps even with the model itself. Materializing the model data offered a powerful quality-control opportunity. Occasionally, the quality control of modeled data led to root-cause discovery of something new about reality.
The primary purpose, however, was to use the model data as intended for the assigned analysis task. The analysis required some query to combine both observed data and model-generated data to come up with some result. The final presentation to a decision-maker usually only required a final result of this combination. That presentation would be just as effective if the model-generated data were computed as needed in procedural code.
It was my experience that frequently the audience of the presentation would request an explanation for a particular result in the analysis. Typically, the result would be some kind of graph of many data points where one data point would catch the decision-maker’s attention. Often when that would happen, the presenter may request a opportunity to get back with an answer later. I preferred being prepared to be able to query the relevant data during the actual presentation. My presentation was prepared with extensive prebuilt drill-down capabilities to go as deep as needed to find a good explanation for any data point. I was able to retrieve any contributing data value that contributed to that specific result because the tables included all of the observed and model-generated values for that analysis.
Above I mentioned the waste of precomputing all possible values because few would ever be needed. The problem is that it is impossible to know in advance what values would catch the audience’s attention. Frequently, querying precomputed values would be faster than computing the values on the fly, especially if the computation involved mapping and summaries of large data sets.
That quick unrestrained drill-down capability was not essential to the presentation, but it was a capability that the audience welcomed.
I am describing a common practice of presentations with extensive slice-and-dice or multidimensional navigation. The distinction is that the some of the dimensions of the multidimensional database contained precomputed model data. I materialized the model data so that it populated peer dimensions with the observed data and made it all available for slicing and dicing the data. The same query process for exploring the data worked identically for model-generated data and observed data because both occupied tables. I could recognize the model-generated data by the names of the dimensions.
As a data clerk, I am continually concerned about whether my data is an accurate representation of what occurred in the real world at the time of the observation. The most valuable contributions I provided my clients were those that I describe as discovered hypotheses: unexpected and new questions to investigate about what is happening right now. In the project I was working on, surprising discoveries were frequent and an important part of the value added from this project.
My preference as a data clerk would be that all of my data would be bright data. Bright data are well-documented, well-controlled, trusted observations. Most data have some problems, and I call that data dim. I would prefer not to have any model-generated data (dark data). Dark data is a substitution for ignorance. The model provides computed results for missing observations.
If something can be computed, it could have been observed if we had the right sensors for it. In an earlier post, I gave the example of coastal tide tables that provided reliable predictions of tide levels. With tide tables in well understood coastal areas, a tide gauge would be redundant. Despite that redundancy, I prefer to have the tide gauge over the tide tables.
In that example, the tide tables explicitly materialize the predicted tide levels and stored the values in tables. This gives the data clerk an opportunity to recognize the deficiency that this data is not observed data. With this knowledge, the data clerk can ask whether there can be another source for this data. In the case of tide tables, there may funds available to acquire and operate a tide gauge. In other examples, perhaps some new sensor technology became available that provides observations that previously required models to estimate. The analyst can then revise the analysis to use that new data instead of the model-generated data.
In contrast, if the models were implemented only in code for computing on the fly, that code is not accessible in production tier. If a new source of bright data became available, another design iteration would be required to replace the computations with queries of observed data. With the model generated data occupying data tables, the switch to use a different table of corresponding observations is possible on the production tier by simply running a different query.
My latest post provides an allegory for how this works using a recently published article. That post discusses an article that shows model data materialized in a data visualization. In this specific case, the model data was the computed location of dark matter. The visualization overlaid this depiction on observed data of galaxies. As I discussed in that post, this gives the impression that the model and observed data are peers of each other. The impression left is that the model-generated distribution of dark matter is actually observed as dark matter.
I don’t have access to the actual data. All I am working on is the visual image presented as part of the popular article. I consider this image to be a data store where the data entries are the individual pixels. The model-generated dark-matter data is intermingled with bright-data observations. In this case, the use of visualization of a transparent ghostly overlay distinguishes the two types of data. The visual presentation of data in the form of an image provides an analogy of materializing the model. The materialized data is in the form of pixels in an image.
The benefit of materializing the model for query access (in this case by visually seeing the image) provided the opportunity to explore alternative sources for this data. The article supposed that dark matter is a diffuse cloud of invisible quantum-scale particles. I proposed and alternative view that entire volume may represent single particle of galactic proportions. I called this a dark-nothing hypothesis.
The dark nothing hypothesis is my proposal for replacing the dark-matter hypothesis. Having the dark matter materialized in the image gave me the opportunity to see where the manufactured data resides and to propose an alternative data source. In this case, that data source is another theory: I’m merely replacing one source of dark data with another source of dark data.
As I emphasize in my discussions on dark data, I prefer to have real observations (bright data) to replace dark data. But in this case, I argue that such observations are impossible because humans will never be able to sample the empty space outside of our stellar neighborhood to see its properties such as how it restrains its embedded matter and its transiting light.
I think this is a useful realization that the bright data observations are impossible to replace the dark data. The impossibility of relevant observations is not unusual in data science. Like its peer disciplines in historical sciences, data science must accommodate the fact that we will never have access to all (or even most) of the data we would like to have. We have to build our theories and arguments around both the available evidence and the acknowledgement of the missing evidence.
While personally I am intrigued by the concept that there may be different natures of nothingness at different scales, I am not defending it as a scientific hypothesis. My point here is that multiple models (theories) can provide data to substitute for the absence of bright-data observations. Having the dark-matter data materialized in the image allows me to distinguish it from the observed data. This presents the opportunity to seek another source of data to replace the dark data. Although my proposal replaces one type of dark data with another, this other type of dark data adds an important contribution of recognizing the inherent human inability to ever be able to obtain bright data observations. Humans will never be able to experiment with empty space at the scale of galaxies nor even to experiment with any concept of empty space outside of our local stellar neighborhood.
While dark data such as dark matter is a substitute for missing observations, it is fundamentally different from substituting for observations missing due to human incompetence that can be rectified. Both dark-matter and dark-nothing is a substitute for something we will never have competence to observe. I’m inclined to have both theories cancel each other out and leave us with the fundamental truth of perpetually missing data. We know we are missing something. This missing something is something we will never find.