In earlier posts, I described a taxonomy of human activities as present tense sciences, the past-tense sciences, and the persuasive arts.
Present tense science is the scientific methods (plural) of gathering new observations of the world where those observations are well documented and well controlled. The goal of present tense science is to produce excellent data to add to the store of knowledge. Present tense science is motivated by hypothesis that need testing. The experiments are documented and controlled using other hypothesis that are more trusted to earn the title of theories or laws.
My definition of present tense science is more general than scholarly science. I place in this same category all human activities involved in addressing the immediate real world. Present tense science includes farmers, engineers, system operators, sales people, marketers, surgeons, and law enforcement officers. Every human activity facing the immediate present world is at least partly a practice of present tense science, even though there is a variety of skills and diligence in this endeavor.
I contrast present tense science with past tense science that I equate to the study of history. I model this idea off of the great disciplines of history, archaeology, paleontology, geography, etc where generations of scientists examine the same evidence to discover new hypotheses or challenge old ones. Past tense scientists work with recorded data (past observations). They use that data to challenge or strengthen old hypothesis, or they use that data to discover new hypothesis. As with present sciences, my definition is more general than the scholarly sciences. I include in past-tense science criminal investigators, the business planners, medical doctors who make diagnosis, etc.
A discovered hypothesis is something that provides some satisfactory explanation of the historical data. That basis of evidence gives the hypothesis some credibility. New hypotheses motivate present tense science to build new experiments where the new results can give support to or cast doubt on the hypothesis.
The two sciences work together as opposing teams: one trying to make observations of the present, and the other trying to make sense of the past. In reality, the same person is usually doing both. Many published scientific works include both the collection of new data and an analysis of that data either to test existing hypotheses or to propose new ones. These are separate activities usually separated in time where the historical science part probably taking up most of the time to write the paper. Even when there is one objective of publishing a paper, I suggest it is useful to distinguish the two sciences involved. In any case, there are also lots of published works that strictly work with historical data without adding new observations. Past observations offer never ending opportunities to challenge or discover hypotheses.
Mediating the two sciences are a third area of human endeavor that I call the persuasive arts. Persuasive arts are about the future. Persuasive arts are about making decisions to choose among options which options to add to history and which to add to missed opportunity. The options added to history will constrain how the future will proceed. The future will always surprise us, but it is at least causally bound by our past choices. Persuasive arts are the politicians, policy makers, religions, judges and juries, and the arts. I include the arts because they seek to influence our thinking even when they offer no particular plan about how we should decide. Persuasion is about convincing people, influencing their thinking, their decision making with the outcome that something will be added to history and the alternatives will be added to missed opportunities.
In this universe of present-tense science, past-tense science, and persuasive (future-focused) arts, where does data science fit?
I consider data science to be firmly in past-tense science. It is always working with historical data. I consider a measurement made even a few seconds earlier to be historical data. The opportunity for an observation is irretrievably in the past.
The actual creation of the observation is the domain of the present-tense science. Even if the measurement is automated, the technology for making that observation was a present-tense science activity. From the perspective of data science, the best data is observation data that is well documented and well controlled. A data scientist is interested in that documentation from the present-tense scientist: what exactly is being observed and what could possibly go wrong with that observation. The data scientist has an obligation to scrutinize the data in the historical record to be certain of what the data actually represents.
The persuasive arts place a different burden on the data scientist. The persuasive arts seek to leverage the data in their arguments to make decisions that will in some sense determine the future. These are decisions with real consequences. The persuasive arts wants to make the best choice (at least from their perspective). To the extent that their arguments rely on hypotheses from data science, the data scientist has some responsibility to assure that the hypotheses is supported by highly trustworthy data.
How does this play out in a big data system? Big data have automated collections of data that comes from various sensors of observations. These various data source have differing levels of quality in terms of documentation and control of what exactly is being measured. The big data system itself has various steps of handling this data to get it into the data store. Once the data is in the store, there are various analysis reporting tools available to end analysts that will provide information to the decision makers.
As a unit, the big data systems begins to appear to be the book with all the answers. Persuasive artists seek to make decisions or persuade other to make their desired decisions. They are highly motivated to seek out and promote some book as being authoritative source of all answers. Big data appears to be that book. In other times or cultures, the book may have been religious texts. Either can serve the same purposes for persuasion. Thus, big data looks somewhat like a holy document. It offers all the answers.
How does a data scientist proceed to come up with a new hypothesis? My answer is that he doesn’t. The data scientist’s primary obligation is to make sure the data is right.
When I started working my past project, I did not seek out new hypothesis. The project has a very narrow objective to produce a defined product to be used for another project. My project used a lot of different types of data to produce this defined product. It was my job to make sure that data was acceptable for that product or at least identify where the data may be deficient.
To perform my job, I created a large number of reports to identify patterns and to trace the patterns to the raw data and ultimately to the data source. I needed these tools to do my job of verifying that today’s data is still appropriate for use in the desired product.
When I found a new pattern my reaction that something must be wrong with the data. My priority was to eliminate any possible fault with the data. My first task was to explore any possibility that my system may be mishandling the data. I cross checked with other data and with log data to confirm that things appeared to be working correctly. My next task was to explore any possibility that there may have been a problem with delivering data to my system: the data received may not be the same as what was meant to be transmitted. Then my task was to coordinate with the source of data to explore any possibility that the observations could be wrong or possibly measuring something other than what we thought it should be measuring.
The observation of anything unexpected was a trigger to suspect the data. Only after exhausting all known possible explanations of bad data would we begin to suggest this might be a new hypothesis. With big data, there are seemingly endless possibilities to suspect the data. As I mentioned previously, there are various levels of confidence we have in different types of data and where much data falls in the category of perpetual suspicion.
In my experience, the big data reports were for the exclusive use of tracking down problems with the data. The reports identified unexpected patterns that triggered this investigation. These exact patterns sometimes became the basis of new hypotheses after exhausting all attempts to explain the pattern as a problem with the data. Thus, the same reports can be used directly by the analysts looking for new hypotheses.
These analysts are focused on satisfying the decision makers and are generally not aware of the nuances of what can go wrong with the data. Given the ease of use of the reports, the analysts are free to use the reports productively for their purposes. This change in use of the reports pushes the data science into the background. The data science activity now has the burden to find problems before the analysts can see them as potential new hypotheses.
Eventually this data science activity appears to be a costly overhead to the project and this cost should be minimized during the operational phase of the project. At this point we confront the question of whether it is possible to design a historical data store where all of the possible problems with data can be anticipated completely during the design and development phase of the project so that there is zero need for data science during the operational phase. There seems to be high confidence that this is possible. This confidence is strengthened by assigning responsibility to a designer for any failure of the data.
I argue this confidence is misplaced. Data science is no different than other historical science. Humans have been involved in the study of history for thousands of years where ancient evidence continues to be challenged for possible flaws. Even recent history with careful recording using video cameras and live reporting is open to debate about the reliability of that data.
The hypotheses supported or discovered by the data is what motivates the historical sciences to scrutinize the data. Flaws in the data can defeat hypothesis, and there are lots of ways that flaws can be found when adequately motivated to hunt for them.
Designers of new big data systems do not have that motivation. The system is not yet operational so it is not possible to know what kind of hypotheses will be discovered once the system becomes operational. Designers lack this information. They base their designs on old hypotheses alone. Historical science wakes up when new hypotheses are discovered or old ones used in new ways supported by recent data.
My previous post on police radars provides an example. Radar for checking vehicular speed is something that can designed with great confidence. However, issuing a speeding ticket is like discovering a hypothesis (this car is speeding). The accused has the opportunity to challenge that finding by challenging the data. Frequently enough, the challenge is successful. The hypothesis motivates the data science. Data science is a necessary ongoing labor expense of any operational system that potentially generates new hypothesis.
Without ongoing data science labor to scrutinize data, there can not be any confidence that the observed patterns have anything to do with the real world. There is too much opportunity for patterns to occur from just random combinations of deficiencies of the data.