In an earlier post, I discussed the idea of at least thinking about this blog as a kind of business. This thinking was not in terms of generating revenue, but instead just as a motivation to investigate into the business aspects of making a business. In particular, I discussed my largely inept explorations into marketing. One thing that I readily admit is that this blog is never going to be marketable because the actual blog posts are not composed for mass consumption. Instead of writing for an audience, I’m writing for selfish purposes of just exploring my thoughts in a way that is only possible when I see them written down. One of the consequences of this selfish approach is that I bury the lead: the most important point tends to be at the very bottom of the post.
My last post was a good example. The entire post is upside down. I didn’t realize what the conclusion was until I got to the end. Once I reached a conclusion, I ended the post. Although it is a definition of a conclusion to expect it at the end, the reader would have no idea it was coming. A work written for a reader, especially a busy reader who has never encountered my writing (or myself) before, would need this point clearly stated at start.
The last post should have lead in with the statement that modern attitudes toward data is a lot like the rejected attitudes we had about everything else in older times. Despite the modern intolerance for disappointment in virtually everything else in life, even in marriages or in our fellowships in a community, we are unusually tolerant of disappointment of data. We let data get away with things we would never tolerate in any other context.
As I think back on that observation, I wonder whether we have it completely backwards. The tolerance for disappointment should instead be granted to everything else but data. Our intolerance for disappointment should be directed exclusively at data.
My earlier posts presented the ideal of bright data. Bright data is well documented and well controlled, both accurate and precise. Bright data is very rare. I used the metaphor of light to describe less ideal data that we usually have available: dim data, dark data, unlit data, etc. These terms were used to describe different ways that the data can become more ambiguous or open to interpretation. Real world data is often very useful despite the fact that it is rarely bright data. Data about the real world is valuable. My observation is that we often accept this possibility of disappointment when it comes to data even though that disappointment could mislead us.
I think that many of the success stories resulting from exploring data is a direct result of this tolerance for disappointment. We approach data projects with unbounded eagerness to see where data might lead us. We readily embrace new discoveries from data and promote them. Even when these discoveries come from sophisticated tools that include rigorous testing such as using statistics, we retain a willful suspension of doubt by assuming the underlying relevance of the actual data used.
My frequent target of suspicion is data generated by models. Model-generated data can be available for rigorous testing, but those tests will pass even though the data tells us nothing new about the world that was not already included in the model. Model generated (or modified) data passes tests but tells us more about the model than it does about something new about the world.
I discussed model-generated data in many previous posts where I described it as dark data. My usage of dark data is not standard. The more commonly accepted definition of dark data is data that has not been previously ignored, and thus unused or verified. I prefer to call this type of data unlit: information that happens to be around but no one has verified what it means. I prefer using the word dark in the way cosmology defines dark matter and dark energy. Their definition of dark is something they assert must exist even though they have never observed it. Dark data is data that is generated by models instead of by observations.
A recurring theme in my posts is the concept of data science as form of historical science. I described this relationship in both views of being a subset or a superset of historical science. The subset view says that data scientists approach data in the same way that researchers in history, archaeology, paleontology, or crime scenes approach evidence. Data is nothing more than evidence of what happened in the past. The temperament of the data science is to approach data with a great deal of suspicion and skepticism. Data is evidence and evidence can disappoint us. The superset view is that all of the historical sciences work with evidence in the same way in terms of scrutiny and argumentation. In all of these disciplines, their evidence is includes non-digital data.
In any case, the fields of historical sciences and data sciences are fields that expect disappointments. Disappointments are inevitable. The job of the historical sciences is to seek out that disappointment. The advocate attempts to defend against the possibility of disappointment. The opposition seeks to expose that disappointment.
I contrast this with what I call the present-tense sciences that I described as all of the human activities that involve interacting with the real world. The present-tense science has no tolerance for disappointment. The goal is to never make a mistake. Because we are actively participating in the world, disappointments will injure us.
Disappointments are to be expected from historical data when the data is not exactly relevant to what we want to know. An example is the the discussion of evolution where we want to relate all life into a single a tree of branches from ancestor species. Often missing is the observation we want: the actual ancestor species at the node of a branch: the common ancestor. Over time, we are constantly finding disappointments in our earlier supposition by some new observation. Addressing historical data, disappointment is always a possibility. Because historical data is data about a long past event, this disappointment can not physically injure us, although it can injure our reputations.
I could distinguish my two sciences in terms of attitudes toward disappointment. Present tense science has no tolerance for disappointment. Past tense science accepts the possibility of disappointment in order to better manage it. Data science is past tense science.
Returning to my original point of this post, our attitudes about disappointment are misplaced. Although living in the present can not tolerate disappointment because it can cause injury, it should accept that the possibility of disappointment is inescapable. On the other hand, the past tense sciences are right to expect disappointment, they should be more intolerant of it.
In many of my posts, I complain about how eager we are to act on what are essentially queries of historical data. To me, this eagerness demonstrates a deliberate acceptance of the consequences of a disappointment. If the query results in a disappointment, we will seek out an corrective action but ultimately we pardon the project for using faulty data.
In contrast, we usually do not pardon humans when they disappointment us. We demand high qualifications for humans to contribute to an action. The possibility of disappointment is a disqualification for a human.
Thus there are two separate outcomes for evidence of disappointment: retire the disappointing human or pardon the disappointing data system.
I argue this is backwards. Humans are living beings that are exquisitely adapted to interact with the real world. Part of that faculty is their ability to learn from their mistakes and in fact have knowledge and skills that can only be earned by prior disappointments. We should expect disappointment from humans. This expectation is what makes possible to give a person a chance to prove something new. Make makes this grant chancy is the possibility of disappointment. While I agree that the disappointment may involve some form of compensatory retribution, I don’t think the disappointment should disqualify the person from future opportunities. For humans, there is a higher likelihood of success based on what he learned from past disappointments.
For humans, past poor performance can suggest the possibility of future successes. People learn from their mistakes.
This is not possible with data projects. Even with sophisticated algorithms like machine learning or predictive analytics, these projects can not recognize and learn from disappointments.
When faced with disappointments, we should be more quick to retire a machine than to retire a human. The modern approach is to do the opposite. We select young talent and old algorithms.
When I work with data, I am very aware that I’m working with the left-overs from a past reality. I enjoy exploring data to find patterns that suggest new ways of looking at the world. However, I also have a naturalist perspective of the real world, the perspective the delights in observing the actual reality with as little preconditions or expectations as possible.
Modern times has a fascination with the type of science grounded first in theory and secondly on observations to support the theory. This fascination pervades all of life from how we do our jobs to how we interact with people around us. Everything has to fit some grand scheme. We are medievalists, but instead of insisting on a scripture based grand scheme, we insist of a scientific-theory based scheme. Observations have to fit this preconceived scheme.
In my mind, a naturalist seeks out to observe without preconditions about how those observations are to be seen. The observations should be recorded as faithfully as possible with documentation and controls to be precise and accurate but also to be completely free of preconceptions. The ideal data that I call bright data is simply this is what was observed.
I recall one day seeing a falcon repeatedly swooping close to the ground where there was a tree squirrel on the ground. Despite multiple swoops near the squirrel, the squirrel only made the slightest movement to avoid the bird and then went about its business as if nothing was happening. I can report this as an observation. A squirrel in that same spot would run to the nearest tree if I showed up at the edge of the yard even though I never bother them at all. I can report this as an observation. I place a high value on an observation even if there may be an error in that observation. There is something to learn from the observation, either about the objects observed or of the subject making the observation. This is an unusual observation: the bird either should have caught the squirrel or the squirrel should have fled to a safer spot. To me it is sufficient to simply record the observation and move on.
However, often when confronted with an unusual observation, we embellish it with added information to give it more credibility or to defend it from criticism. For example, I could suggest that the falcon appeared to be not fully adult and thus not yet capable of hunting, or I could suggest that this was just like the case where a mockingbird would harass a squirrel for whatever reason. This is an example of making the observation fit a theory. I have no expertise to estimate the maturity of the bird about the meaning of its flight pattern. Adding this information will make the observation easier to comprehend by contaminating the observation with preconceived notions.
Most of the data we have available in data stores has some degree of transformation to make it fit with our expectations. We filter out outlier data. We smooth other data to eliminate what we feel are measurement errors or noise. We associate data based on supposed relationships that not always be correct. Large data stores have large amounts of information that is not a reliable representation of what actually occurred in the real world.
In my posts about predictive analytics, I tried to make this case that the information inside the data may be unreliable. Even with sophisticated algorithms for analytics, the underlying information may mislead us into making poor decisions. Although I am a fan of analytic algorithms, I am concerned about the possibility of negligence or fraud by being too permissive of disappointment by the data.
Our attitude about tolerating disappointment from human actors and from data are backwards. We should be more tolerant to humans capable of learning, and less tolerant to data that can never learn.