In my last post, I described how sciences dealing with the immediate physical world involve models where time is an independent variable. For dealing with the world, we have theories that connect causal events by the elapsed time between those events. In contrast, the science of studying the record of past events is dependent on observations rather than time. For data science (historical science) the independent variable is observations where time is just a part of the observation.
I think this distinction is useful. Observations, not time-based models, provide the primary value to historical analysis. For historical analysis, there is an unlimited demand for observations far in excess for what is required to interact with the physical world. I provided the example of the gardener who can ignore a planted seed for 10 weeks and then confidently go out and find something nearly ready for harvest. The data science would welcome exhaustive observations made several times a day for the entire period. Those observations are valuable even if they would provide no benefit for the progress of the plant.
In my earlier posts where I described a taxonomy of different data types, this importance on observations was presented as a preference for bright data (well documented and controlled observations) over dark data (model-generated data). The reason is that an abundance of observations absent of any preconceived models provides us the opportunity to discover new hypotheses, and in particular to discover hypotheses that can challenge the preconceived models. Accepting model-generated data instead of independent observations will end up confirming that preconceived model.
In other posts, I defended the need for labor-intensive data science practices during the operational phase of a data project because of the dynamics of observations changing unexpectedly over time. The sources of the observations may degrade or they may be replaced with substitutions that approximate but don’t exactly match the data from the original source. Often the nature of the subject being observed may change so that the otherwise unchanged source will no longer record the same type of information. Using the gardening example from the last post, an example may be a sensor of leaf area may become inaccurate by the presence of a new disease that kills off part of the leaf. The inherent value of the observations justifies the investment of routine analysis to confirm that the observations continue to accurately measure what is required.
The historical analysis uses the observations in a very different way than the operational system may use measurements to control a process. The operational system is very time-sensitive. It may need measurements to be updated at precise intervals. Also, it has no need for old obsolete measurements. In contrast, the historical analysis can use observations that may occur more frequently than needed or less frequently than needed. A farmer can quickly assess the status of his crop by a short observation while the data-collection approach may continuously and extensively measure every plant.
The observation is the treasure of historical analysis. The recent parallel advancements of inexpensive sensors plus the affordable storage and efficient retrieval of data has been a bonanza for historical analysis. We are still adapting to reality of a luxury of voluminous observation data for analysis. The developments are so recent that we are still using old approaches that attempt to reduce the data into statistical or mathematical models that were essential when observations were rare and expensive.
What if man had access to cheap sensors and big data technology before he had any other kind of science? Would we have bothered to invent theories of simple mathematical or statistical models to understand nature. With big data, we would have access to extensive past observations. We could retrieve actual observations instead of trying to figure out models that would predict the missing information.
For example, we would not need to project some value for a population based on a statistical sample if we had access to inexpensive and immediate observations of that entire population. Even many engineering models could be replaced with query of recorded observations of every possible condition that may be encountered. For example, this article suggests that Google’s language translation uses algorithms based on recorded historical translations as an alternative to approaches that attempt to emulate human language processing.
I recall early in my career where I was taught to look-up values such as logarithms or normal distribution values from printed tables. Every possible value (at least to a few significant figures) was in a table that could be looked up. Even random numbers were printed in books. One possible approach to transfer these to computers would be to merely copy these tables into computer memory and automate the look-up of the values. I recall some discussion that this would have been preferred for long-term efficiency of repeatedly recomputing the same values and the reduction of the possibility of a random-occurring calculation error. At that time, early computing technology made processing much cheaper than memory storage so we invested in algorithms that provided the benefit of more precision even if the values would be computed redundantly over time.
This was a much bigger issue when it came to data collections. In particular, I am thinking about data collections for computer simulation or other analysis. Frequently, the amount of data collected would far exceed the capacity to store that data in the computer. To solve this problem, we developed statistical models of the voluminous data so that a few parameters could reasonably represent the observations. Simulations would then process the statistical model to reproduce representative values to replace what we couldn’t store in the first place. If memory were cheaper, we would have preferred to use the original data without the statistical models. Indeed in many cases the statistical models can hide details such as correlated observations or the failure of observations to be ergodic (population averages for a particular time can be estimated from observations at different times)
Now we have cheap memory and the ability to access actual data collections of far larger volumes. In some cases, we continue to use old techniques that require us to reduce the available observations into approximate statistical models so that we can continue to use old techniques that were justified originally because memory was too expensive to hold all of the observations. We are only beginning to shed these old practices and embrace the power of working with data without imposing data-reduction practices previously required to avoid excessive storage requirements.
I am thinking more broadly about science in general. The recent posts have made the distinction that present-tense sciences are focused on time while past-tense sciences are focused on observations. I suggested that this is an good reason to separate the two as two different intellectual pursuits.
However, I wonder if it is possible that a science based on observations rather than time can allow us to understand the physical world. Can the real world be understood in terms of an immense number of observations? Everything about the world could be explained as observations, as data entries in a data store. We would still derive causal relationships based on observations, but the relationships do not need to have some common underlying explanation. Observations of a falling objects and observations of an object being sent along a trajectory may suggest two separate relationships that don’t demand a common underlying explanation (such as gravity). Mathematical models based solely on the available observed data may end up with models that have nothing obviously in common when otherwise we would have conjectured that there was something common (such as gravity). Alternatively, the mathematical models may suggest something in common when there is very little in common (such as electromagnetic waves are the same as earthquake waves).
In an earlier post (further explored here), I asserted that time-travel was a nonsensical notion because the physical reality is fixed at just outside the edge of measurable. Physical matter and energy exists distinct from time: where one exists the other does not. In this formulation, time could be replaced with observations. Observations at a particular time automatically become historical data about what the physical quantities were just before the observation.
The gradual accumulation of observations gives the illusion of the passage of time but time is always historical time. Time is historical observations of reality. The illusion of time caused by the succession of observations is why time can only go in one direction. Time doesn’t really exist at all. Only observations exist.
Also, our awareness of reality is limited by our ability to make observations. These observations are very helpful but the actual physical world is just beyond the reach of our observations. The physical world is always just one step ahead of us.
If we had begun our scientific inquiries with ready access to big-data technologies, this would not have been problem. We would be comfortable working with observations as a subject of their own right. We could come up with completely different strategies for interacting with the world based on queries of the big data store. We would not need models that suppose we have access to the physical world.
The observation approach is a more accurate and realistic assessment of reality. We do not have direct access to the physical world. Instead we only have access to our observations about that world. Every observation is a historical observation, of what the world used to be like even if it was just a nanosecond ago. We only need one science, not two, and that science is the science of the past-tense.
This conclusion is one that I see us making as we continue to be seduced by the wonders of big data. We can do science, all science, on previously recorded data. Increasingly, we run experiments, do hypothesis testing, obtain statistical inferences on already collected data, historical data. Working with historical data is valuable but only to the extent that it gives us the opportunity to discover new hypothesis. Testing hypothesis is the domain of present-tense science because it involves designing new experiments to collect new data collections with careful documentation and controls to focus on the issues raised by the discovered hypothesis. Using the same data to both discover and test a hypothesis is analogous to trying to discover a new hypothesis where all of the data is simulated from models of previous theories: it is self-confirming.
Abstractly, the physical world resides beyond our immediate grasp. We have access only to observations about the physical world. However, even the biggest of big data projects capture a trivial portion of the possible observations about the world that may related to a particular hypothesis. Yet, increasingly we are excited about the prospect of such concepts such as predictive analytics that at its center is the simultaneous discovery and testing of hypothesis based solely on preexisting historical data.