I concluded my last post with a distinction of present-tense science of dealing with the physical world and past-tense science of interpreting recorded information. That distinction is that for present-tense science time is an independent variable upon which various causal relationships depend. In contrast, past-tense science the observation is the independent variable that happens to have a time stamp as one of the things to observe.
There may be a way to reconcile the two. For example, in quantum mechanics there is a concept that certain phenomena are in a probabilistic combination of two contradictory states until there is an observation. This could suggest that observation is always the independent variable where time becomes some regularly recurring observation. For this post, I prefer to leave the two concepts as distinct. The sciences involving the present-tense physical reality focuses on time. The past-tense interpretation of recorded information focuses on observations, recorded data.
My background is in electrical engineering and in particular in communications and control systems that both involved a lot of differential equations based on derivatives of time. These equations reliably project future outcomes based on specific initial conditions. Once the initial conditions are known, any later observation is redundant to what the equations predict. Assuming we can measure initial conditions perfectly and that the experiment presents an ideal environment, we know what happens at any time based on the equations.
For this post, a better example is a more common experience of gardening. During planting season, one may purchase a package of 20 seeds with a guaranteed 90% germination rate and instructions that say that the plant will bear fruit 10 weeks after planting. He can plant the seeds in well prepared soil and according to instructions. After planting, he could simply ignore the plot for 10 weeks and come back and find at least 18 plants each with a ready-to-pick fruit. Realistically experience says that weeds may start to compete with the plant every 4 weeks so maybe he will need to go out every 4 weeks to pull up the weeds. Even then he only needs to pay attention to the brief activity of pulling weeds and then ignore the plot for all the time in between.
It is very useful to know that the plants will grow on their own at a predictable rate. The gardener doesn’t have to stand on top of the plot continuously from the point of planting the seeds to the point when the fruit is ready to pick. There are plenty of other chores that need attention. From the present-tense experience, time alone is sufficient to know the progress of the plant.
For data science, there is a completely different motivation of obtaining observations for the entire process. I could see a value to collecting observations throughout the entire growth period. From the moment the seeds are planted, there could hourly measurements of soil moisture, temperature, and sunlight exposure at each of the 20 locations. As the plants emerge, there could be periodic updates on the plant’s mass, height, number of branches and leaves. Each of the leaves of each plant could be measured for size and color. These measurements can occur for every plant several times per day.
Note that this example is not a scientific experiment. As noted earlier, all of these observations are unnecessary for the project of harvesting the ripe fruit. Instead, data science would welcome these observations very same scenario of planting something to harvest 10 weeks later. The observations are valuable for the project of doing historical science. Even in this project, most of the observations may never be found to be useful. The millions or billions of recorded observations are available for querying and aggregations. It is likely that these queries will not show anything interesting at all. Even if these queries do produce a discovery of a new pattern that suggests a new theory, the pattern will only use a small fraction of all of the observations. The data science still values all of the observations. We would like to keep this data indefinitely so that we can append new data from future growing seasons.
I meant this as a silly example, but it occurs to me that this is probably already happening to some extent. Affordable consumer technologies are available to perform this kind of dedicated data collection. Given the examples of various projects I’ve seen on the Internet, I have no doubt there is someone who has been doing something similar to this for several seasons by now.
I now recall a reading a story about commercial farmers using inexpensive aerial drones to monitor their crops. The data collected for present-tense farming as a way to optimize the current crop. But the data is cheap to store indefinitely allowing for future historical analysis comparing the same period of different seasons and then using that information to compare with the final quality of the harvest. It is likely that most of the time the farmer’s efforts may not have been any different from when he lacked this data. The progress of the crop is pretty much determined by the number of days and the progress of the seasonal changes. The progress of the crop is primarily dependent on time. Even if the observations are not helpful for the current crop, the data becomes valuable for its own purposes.
Although recent advances have made observation and measuring technologies more readily available and more affordable, we have always had some processes that depended on some form of measurement. What has changed recently is the data technologies that permit not only affordable long-term storage of recorded observations but powerful tools for querying that data and organizing that data into reports for new types of analysis.
With these data technologies, we are going beyond recording the data that is essential for operating a system to seeking out previously unnecessary observations. We are adding more sensors for more types of information and we are deploying these sensors more exhaustively through the process. In the farming scenario, there is a desire for recurring observations of each individual plant.
The gardening scenario is an analogy of what is happening on ourselves. Like the farmer who desires recurring observations of each individual plant, corporations and governments are desiring (and likely obtaining) recurring observations on each individual human whether it is a customer, a potential customer, a citizen, a person who happens to be in a particular location. Just like the farmer case, recent technological advances makes this inevitable. Observation technology is cheap and readily available. The resulting digital data is easily stored indefinitely and easily queried for analysis.
Recently, there have been many discussions of the controversy about this data collection by private and public organizations: government and non-government. I suggest that it may be helpful to distinguish two objectives of this data as the two sciences I described. The data may be used in present-tense where the observations motivate intervention or proactive actions to control current events. Alternatively the data may be used in the past-tense where the events are long past and we are instead interested in improving our understanding of basic knowledge about some topic.
The gardening example presents the historical case nicely because eventually the plants will have lived their life cycle but the observed data will continue to be useful indefinitely. In the gardening example, a query of observations over multiple seasons may discover a new opportunity to make better crops. In that case the individual plants observed are long gone.
Many of the big data projects involving observations about people are motivated for similar long term trending information to help us improve policies or practices. Unfortunately, this historical data includes observations of still-living people. That data meant for historical-data analysis can be used in present-tense. This is happening today where data collected for a historical-analysis purpose is later used to criticize, embarrass, or disqualify a person later in life. When this happens, it may still be helpful to recognize that historical data is being reused for present-tense purposes, although I also recognize that maybe my distinction is one without a difference.
For this post I want to return to the original statement of distinguishing time-dependent science from observation-dependent science.
Many processes we use to interact with the physical world involve fairly well known causal properties that allows us to predict future conditions based on initial conditions and the passage of time. Although most operational systems have some form of measurement to use in feedback loops to control the processes, I make the distinction that these measurements are needed only for a brief period of time and then can be discarded. The combination of the controlled physical process remain dependent on time and often in very well understood causal relationships.
In contrast, the historical sciences are dependent on observations where time is one of many quantities that can be measured. The historical sciences are greedy for far more observations than is needed to interact with the physical world. These sciences are also greedy in terms of retaining this data long after it is useful for interacting with the physical world. Although I have not made the case in this post, this greediness is not new. Historical studies are characterized by careful preservation of historical evidence and are eager to obtain as many observations as possible. What is new today is the availability of abundant and affordable observation-gathering tools and the availability of long-term storage of large volumes of data with tools for quick retrieval.
The point I want to continue to explore is this distinction of focus on observations instead of time suggests that data-science is fundamentally different then physical sciences of the present tense. As noted above when I extended the gardening example to monitoring humans, there may not be an easy way to distinguish the two efforts. However, I think the two efforts do suggest there are two ways to view the world: the present-tense science where the world is understood in terms of time; and the past-tense science where the world is understood in terms of observations.