In earlier posts, I distinguished sciences between present-tense and past-tense science. The distinction is that past tense science is restrained by recorded observations while present-tense science has the opportunity to control the collection of new observations.
The present-tense science includes a full range of our interaction with the physical world, including the engineering and operating of technologies where observations are sometimes a mere by product of their primary purpose. The present-tense science stays in the present in a way that allows it to change how processes are run and observed.
The past-tense sciences focus on the problem of getting the best information out of recorded observations. The observations are subject to scrutiny for accuracy, relevance, and lack of ambiguity. The goal of the past-tense science is to derive new hypotheses or test existing ones based on the available recorded observations.
In practice, individuals will specialize in some field of inquiry and engage in both types of science. Scientists in an particular field will design and control their own experiments that collect information that they will then use to test hypotheses or propose new ones.
My motivation for separating the two activities is to describe the specialty of data science as a purely historical science of making the most of recorded observations of events that can never be repeated. In this context, historical measurements may be only a few seconds old, but they have time-stamps that can never be reproduced by future observations. Even such recent observations become subject to the same kinds of scrutiny and doubt as we subject much more historical data such as ancient written accounts or discovered artifacts. Although the observations can never be reproduced, these observations can be subject to repeated rounds of analysis for supporting or countering claims about what is being observed. Any analysis of a particular set of observations is available for future criticism or reinterpretation.
Lately, I have been thinking about the different concepts of time in my separation of present- and past-tense sciences. Past tense science, data science, treats time (a time-stamp) as an observation. Although the time-stamp comes from some clock, it could be any value. Although the discipline deals with the reality that old events can never be reproduced with the same time-stamps, the data of the time-stamp itself is arbitrary.
In contrast, the present-tense sciences recognize causality where events are constrained by the elapsed time from other events. This causal relationship may be so well understood that we can infer the intermediate observations without needing to make new ones.
The difference of my definition of the two sciences is a difference in the value of intermediate observations. In my definition, I describe one goal of historical science is to discover new hypotheses. A new hypothesis is one that proposes a new form of causality or predictable relationship that was not previously known. To support that mission, the historical science is interested in observations that would seem redundant to the present-tense science that accepts current causal explanations.
For an illustration, consider the problem of tracking a commercial airline flying level at cruising altitude somewhere in the middle of its flight path. The plane may remain on this path for several hours. A present-tense science can be satisfied with a single observation of position, direction, and speed. With this single observation, the position of the aircraft 10 minutes later can be calculated. The only reason for a new observation is the small errors that may be present in the initial observations. If there is a new observation 10 minutes later, there are two measurements 10 minutes apart and they agree with each other in terms of speed and direction and the distance is exactly what is expected from the causal relationship of speed and time. Given these two observations indicate very little error in the measurements, there is high confidence of the position of the aircraft at each instant between the two observations ten minutes apart. The present-tense science has little motivation to make a new measurement in the intermediate interval.
In contrast, the past-tense science may be interested in the position of the aircraft at the intermediate 5 minute interval. This observation may be needed in order to match other data that has that same time-stamp.
In earlier posts, I made a distinction of bright and dark data using my definition of bright data being a direct observation and dark data being an calculated value based on models. I made this distinction because the model-based dark data can interfere with the objective of discovering new hypotheses. The model-generated data biases the project to continue using the old model.
In the above scenario, the data science project would prefer to have a fresh observation (bright data) at the intermediate 5-minute point even though the present-tense science would be satisfied with the model-generated (dark data) for the same time. In practice, the data science project will use the model generated value because it is the only one available. However, the model generated value lacks the same quality a fresh observation.
For this example, there is little if any concern about the reliability and relevance of the model generated value. However, consider an scenario where the technology suddenly improves to provide observations every minute. The present-tense science may be satisfied with existing investment in technologies based on 10-minute samples and thus may decide to just ignore the intermediate observation or combine the values to continue to feed its processes with 10 minute updates. Because the present-tense system can continue to meet its requirements without using the the more frequent observations, it may ignore these additional observations.
In contrast, the past-tense science would be eager to have the one minute updates to replace the need for model-generated numbers. The reason for this difference of opinion is the difference in the meaning of time. The present-tense system is only concerned with time as a causal factor for its analysis of current events. Once an event becomes historic, it becomes irrelevant. In contrast, the past-tense science saves that observation indefinitely and any analysis using that observation is perpetually subject to future scrutiny and criticism. Historical data may be revisited in the future with new generations of analysts with newer concepts. The use of model-generated observations instead of fresh observations can at some future date raise doubts about past analysis.
I have experienced a case using a completely different type of data but where the historical data analysis valued more frequent observations that the present-tense system ignored. I’d approximate this case using a rough analogy to this aircraft scenario may be a pilot from another aircraft flying nearby reports the first aircraft to be flying at a lower altitude than it should be. Perhaps this was a temporary change that occurred just long enough to be noticed by the second pilot. This information is discovered much later, perhaps days later. Although the present-tense science of operating the flight was satisfied with 10 minute updates, the past-tense science would be eager to see 1 minute updates to see if there was in fact an altitude deviation, whether it occurred at the time reported, how much the deviation was, and whether this was a common event on this flight, for this airplane, this pilot, this route, or some combination.
To support historical analysis, data science places more value on observations than would be absolutely necessary for the present-tense science. Even when the present-tense science involved an experiment to test a hypothesis, the experiment may minimize the observations to meet the present needs. Once those observations become part of the historical record, they become available for future analysis that may find fault in inadequate observations. For historical data, the additional observations are valuable even though they may seem redundant or unnecessary for the present experiment. An example may be where the hypothesis tested by an earlier experiment is refined to include additional factors that may be more time sensitive. In such a case, the earlier experiment would have been more useful if it had more frequent observations than originally necessary.
For this post, I’m focusing on a distinction between two science’s concepts of time. For present-tense science of managing the physical world, time is analytic. The present events are causally dependent on immediate past events. In contrast, for past-tense science of scrutinizing historical records, time is historic and is subject to scrutiny as new information or theories become available.
For present-tense science, time is an independent variable that other processes depend on. For past-tense science, the observation is the independent variable that includes a time-stamp. As a past-tense science, data science values observations instead of time.
The recent improvements in data technologies provides us long-term stored data of increasing volume and scope of observations. The amount of data available is more than is required for operational needs. Although costs are decreasing there is still a significant cost for managing this amount of data. The more data we have the more we are willing to pay the cost to keep it and make it available for analysis. Historical analysis values observations more than operational present-tense systems do. The observation is the independent variable for historical analysis like time is the independent variable for operational systems.