A frequent topic on this blog is what I call dark data, a term I invented to make analogy to astronomy’s dark data before I learned that to be in conflict with the definition of the same term in common data science practice. My definition refers to using prior preconceptions or theories to generate an observation where an real-world observation is missing or not possible. I compare this to astronomy’s dark matter or dark energy because of the similarity of the assumption that something predicted by theory must exist despite the lack of an observation. We trust this assumption because it comes from trusted and well tested theories.
As an aside and for clarification, the common data science term for dark data refers to additional data available in a data store that organizations are not exploiting in analysis. For example, a business that analyzes lots of data on sales, inventory, customers, etc, may ignore the data that may involve the on/off time of motion-sensitive lighting in their conference rooms. Although such motion-detector data may be available, they are not attempting to use it in their business analysis and thus may be missing some opportunity for some kind of insight. This data is termed dark because it is unused. Unused data is also comparable to astronomy’s dark matter in the sense that there is a lot more of the dark stuff than the regular stuff. I earlier proposed a term for this same data as “unlit” data in keeping with my schema of describing data by the light it presents: unlit data is data that has not been used. Other names for the same concept could include unsolicited data, or perhaps spam data. Spam data is similar to spam email in the sense that is readily provided information that I ignore despite the fact that I get far more of it than the informative emails. For this post, I’ll continue to use my definition of dark data as model-generated data that substitutes for missing observations: similar to how astronomy uses dark data or energy.
The problem with model generated data is that it biases our pool of observations with our preconceived notions and this can make new discoveries of real world phenomena more difficult. With enough model-generated data, our analytics can end up merely confirming our preconceptions instead of observing something important about reality. In many modern data systems, we attempt to tighten a feedback loop to apply predictive recommendations to operations as soon as possible. Feedback systems can amplify the impact of certain pieces of data and if this feedback data includes model-generated data, the feedback can magnify our preconceptions and impose them onto the world. Instead of discovering the natural world, we are imposing our views about how the world should work.
The presence of model generated data in the data store presents the possibility of constructing new theories that accept the truth of model generated data. Initially, we introduced the model-generated data to fill in gaps in our knowledge to explain some detail. But once we accept this as part of the data, we assume it has similar validity reality as observation data so that is eligible to include in future theories.
One of goals of data project is to discover new truths about the world so that we can make better decisions. This goal is to revise or replace some preconceived notion. Allowing the data to include model-generated data can prevent us from discovering something new about the world. Instead our discovery may be something new about our preconceptions.
If there is something new to learn about the real world, then there must be something wrong with our preconceived theories. Where such errors are suspected, the model generated data become similar to rumors. For example, when news reporting includes such rumors in the reporting of some event, the news consumer can infer that the rumor may explain the event. We may accept the rumors because they come from trusted sources, but such sources may be fallible. Rumors are inferior to hard evidence. As a result, good journalistic practice is to avoid the rumors although frequently they report the rumor with a disclaimer that they were unable to independently confirm the information.
When there is a possibility of a model being wrong, then the model-generated data is similar to a rumor: it needs independent confirmation with a real world observation. When it comes to trying to discover something new about the way the world works, it is better to leave gaps in observations than to fill them in with assumed values. When the point of the project is to discover something new about the world, we has to suspect fallibility in our prior theories. Adding theory generated data to the data store is to presuppose the opposite: that there is nothing new to learn about that particular theory. Unfortunately, many analytic algorithms can not operate without filling these gaps either explicitly by adding generated data, or implicitly by the algorithms inferring the missing data.
I make this argument in part due to an actual experience. I produced a visualization report that made very compelling presentation of the observed data. Coincidentally, this presentation also made obvious a major gap in the data. Based on the visualization it seemed obvious what this missing data should be. The users requested that the report simply fill in the missing part with what seemed obvious. I attempted to defend my choice of leaving the gap in the visualization because the visualization was strictly of observations and that gap was an honest statement of the lack of observations. I also attempted to argue that there is a possibility that a real observation could contradict what we think is obvious. Although I resisted the revision to fill in the gap with model-generated data, the compelling nature of the visualization encouraged the users to draw in that missing piece themselves. The reason why they asked for the change in the report was to save them the bother of modifying the image to complete the picture.
There is a good reason to leave gaps in observations as gaps in reports. Our initial justification for filling in the gap is to make a more complete picture. However, filling that gap has two serious consequences for future analysis. One consequence is that it eliminates the incentive to consider other possibilities of observations for that gap or to address the reason why this data is missing. The other potentially more serious consequence is that it encourages the analysts to construct new theories based on the filled-in data is a fact. These other theories may be completely unrelated to the purposes that justified introducing the model-generated data.
I think an example of this latter consequence is illustrated in a recent report from astronomy about a discrepancy of observations vs simulations when it comes to the amount of ultraviolet light in the universe. To be honest, I’m confused by the article about whether the discrepancy is too much or too little light, but that it is not important for my point. My point is that this discrepancy has nothing to do with gravity but instead about the abundance of neutral and ionized hydrogen. The discrepancy is so large that the it needs an explanation. One of the offered explanations is some kind of contribution by dark matter.
Dark matter is a supposition to provide the missing mass needed to explain the motion of galaxies. Its introduction as evidence was for the purpose of completing a visualization of gravity in the universe. However, we use the term dark matter to acknowledge that we have no observation of the matter that accounts for this mass. With the ultraviolet light problem, we have a new set of observations that need an explanation. The proposed explanation of this new problem assumes that existence of dark matter has validity that could account for this observation.
The model-generated data of dark matter biased our subsequent hypothesis discovery of unrelated phenomena. This specific explanation supposes something about the rumored dark matter (that it might decay into ultraviolet light). This hypothesis is not possible based on observations alone. It needed the acceptance of model-generated dark matter as evidence to explain a different phenomena.
This example illustrates a type of circular reasoning that can occur when we mix model-generated data with observation data. The acceptance evidence of model-generated data allows us to propose an explanation for something that otherwise can not be explained, and proposed explanation validates the existence of the invented data. Dark matter is the source of excess ultraviolet light and the excess ultraviolet light confirms the existence of dark matter.
As I mentioned before, I am inclined to trust the astronomy community in their educated guesses about the universe. The study of astronomy involves a very large and diverse group of highly trained scientists all looking at the same data that is accumulating slow enough to allow for collaboration and independent interpretation. There may be a lot of credibility to the claims of dark matter and at least some credibility to its possible role to explain ultraviolet light measurements.
My concern is that most data projects do not enjoy similarly large and diverse teams of experts with the luxury to carefully scrutinize the data. Many data projects have just one or a few skilled data scientists working in isolation with very large amounts of data arriving very quickly. The industry trend is to fully automate this scrutiny so that we can enjoy the full benefits of new discoveries. For such projects, the introduction of model generated data can be very dangerous because the recommendations will become biased by our potentially incorrect preconceptions instead of what is actually occurring in the real world.
Most data projects have more in common with modern journalism than with astronomy. News journalism values independent confirmation of evidence over rumor or hearsay. When journalism introduces rumor as part of the reporting for some news event, they invite the journalism consumers to conclude the rumor is relevant to the explanation. People will act on the rumors to make more news events. In effect, reporting rumors transforms the news from being about the world to being about the news reports.
A similar thing happens when we introduce non-observation data into data sets. The conclusions can become more about our theories that generated the data than about the real world.
6 thoughts on “Model generated data contaminates big data”
Pingback: Critical theory in data science: extracting dark dark data for scrutiny | kenneumeister
Pingback: The dark side of Truth is that it prevents discovery of solutions relevant to the present | kenneumeister
Pingback: A need for a new rhetoric for data, identifying fallacies in data | kenneumeister
This recent analysis examines the problem
This is the concern I have been addressing on this blog. The worst example of dark data (model generated data) is when the model data replaces actual observation data. I prefer to leave observation data as measured and then work with that using analysis that considers possible biases. Once we replace observation data with model-generated data, analysis can only learn about the model, not the current reality.
Pingback: Dark nothing hypothesis macro-sized particles | kenneumeister
Pingback: The dark side of Truth is that it prevents discovery of solutions relevant to the present | Hypothesis Discovery