In this blog, I have been discussing a perspective gained from my specific experience working directly with a large amount of diverse data. When I started that project, I was thinking only of the technical challenges of getting data, preparing it for the data store, and doing something useful with it. These were mostly about technology issues, software scripts, regular expressions, connecting to remote sources, building management mechanisms for continuous operations, working out queries, laying out reports for users. It kept me busy.
It was only when I started to write this blog that I started to think about data as a subject of inquiry of its own. I started to talk about nature of different types of data (that I called in varying degrees of brightness) and about a life cycle of data. My motivation for talking about data in this way was that I wanted to write about what I learned working with data but I couldn’t effectively describe the details of what I did. I needed to elevate the discussion about qualities of the abstract notion of data. I found I had a lot to write about.
Again these thoughts were of my own experiences and not reflective of any general survey of all experiences with data. Despite this personal perspective, I started to use the term data science to describe what I was talking about. In this context, data science is a science of data. I’m specifically thinking in analogy to a field scientist making observations of his subject in the natural settings. I’m attempting to observe a specific family of creatures known as data.
One of things I observed about data is that seems to have a particular life cycle of utility from observation to action. The life-cycle works something like the following:
- Collection of observations with some control of what is being observed and documentation of the observations. Data gets its start when an observation is documented. I described this as the discipline I called present-tense science, a discipline I defined to involve everything that specifically confronts the reality of the physical world.
- Discovery of hypotheses by compiling and conforming data from multiple sources to support various approaches to explore the data to observe suggestions of possible new forms of knowledge.
- Testing of discovered hypothesis by selecting well controlled and documented data specifically relevant to the hypothesis in order to test the validity of the theory and eliminate other possible confounding explanations.
- Making of decisions to apply these tested discovered hypotheses into operations that involve the physical world, or the present tense science. This may involve production of new technologies such as inventions, or it may involve changes in policies through better predictions or prescriptions.
From this perspective of what happens to data, I imagine dividing sciences (human inquiry) into three groups:
- Present tense science involves interacting with the real world and being exposed to the risks of injury from this interaction. In the data life cycle, the initial collection of observations, the execution of tests of hypothesis, and the target of decision making are all in the present tense arena.
- Past tense science involves the handling of the data and includes diligent scrutiny of the data for its accuracy, relevance, and conformity among its various sources. The processes of discovering hypotheses, evaluating tests, and communicating to decision makers are all in the past tense arena.
- Persuasive arts involves the process of reaching a decision, especially within communities. I am also inclined to call this the science of the future tense. I consider it an art because it involves creating arguments that project past evidence and hypothesis into the future. It is also unlike the other tenses because there are multiple possible futures that are largely unknown but only one will become the present and past tense.
This organization of modes of inquiry is from a perspective of my personal experience working with data at different points of the life cycle of data. Even as I confronted issues in each stage of data cycle within the three tenses, I imagined myself as a theatrical actor playing multiple parts in the same play. Each time I would appear different and present a different personality depending on the demands of the character needed for that particular scene. It is very important to draw boundaries between these disciplines and steps of the data life cycle because each requires different considerations of opportunities and risks.
These are just personal reflections of my experiences presented in a personal blog. It does present a philosophical world view that seems to hold together even in more broader contexts than the mere handling of data. In this philosophy the present-tense brain matter processes strictly historical data in order to engage with the present tense in hopes of influencing the future. To do this there is a certain pattern of behavior that seems centered on data.
Data is the independent variable of rational intelligent behavior. Data is always about the past tense because part of the documentation of that data is the time-stamp that will never be repeated. Even data that is a few picoseconds old is still lost to the past because of that documented time stamp. Consequently, intelligence resides in the past tense that is somehow divorced from the present tense. Intelligence is prized for its ability to influence the future but that influence is blocked by the uncompromising and determined present.
I feel comfortable reflecting on my own views this way in part because I’m comfortable in the world of fiction. I am inventing a philosophy built on the postulate of the centrality of data. This could be a seed for some fictional story telling. When I describe what kind of stories I like to write, I usually describe it as social fiction — imagining a feasible world that behaves differently because of a different philosophy. Perhaps this blog is simply an scratch pad for a future work of social fiction.
I discussed this fictional philosophy in various ways in earlier blog posts as reflections of my impressions while doing my work. The recent post on a lecture on the nature of evidence delighted me at first because I saw confirmation in some of what I’ve been talking about. Specifically, it confirmed the need for awareness of the various ways that data (evidence) can be challenged for validity or relevance. However, I am also very aware that most of the lecture is in stark contrast to what I’ve been talking about. My philosophical views presented above are not consistent with accepted practice especially in sciences. I may as well be writing a work of fiction.
The particular example from the lecture concerns his first example case study of the economics debate the relative explanatory power of two approaches to validate a theory. One side promotes statistical inference of field study data as sufficient for establishing validity. The other side insists that randomized control trials (RCT) can overrule conclusions of field study inferences: that RCT produces a superior standard of validity.
The lecturer then presents a way to describe both as complementary approaches. Field study inferences establish external validity that the theory is useful in the real world. RCT establish internal validity that the causal theory is complete and produces repeatable and predictable results when the experiment is properly controlled despite random subjects.
In the lecture these approaches are presented as two parallel lines of inquiry. Each has independent practices for collecting and interpreting data. Each works with this internal data to present conclusions to decision makers. Decision makers can and will be presented with two competing views of reality.
It is not hard for me to imagine decision makers demanding the scientists to come up with a single version of reality. Perhaps that is the motivation for the debate and why it matters. There needs to be a hierarchy of validity. If there are two parallel approaches and they happen to present different theories for the same topic, then the decision maker has to make a choice. The decision maker has the responsibility of choosing the one with the most validity. He needs an answer of which approach is more valid.
In my fictional world view this conflict does not occur because I place the activities in a sequence of steps that I call a data life cycle. The observations feed a process of hypothesis discovery that includes statistical inference. The discovered hypothesis always needs testing. In my view, the rigor exemplified by RCT is representative of ideal forms of testing of a hypothesis.
However, I allow for the inevitable possibility that the only practical approach to testing a discovered hypothesis is by making a decision to put it into practice immediately without controlled laboratory-style testing. This immediate application of a discovered hypothesis combines decision making with hypothesis testing because the hypothesis is tested by whether it works when applied to the uncontrolled world with potentially rewarding or injurious consequences.
It is worth noting that even with highest quality laboratory testing such as RCT, the application of a theory on the broader real world will test that theory and expose it to risks of failing when confronting the unexpected. This is external validity argument for the primacy of statistical inference discussed in the lecture.
Instead of arranging the different inquiries as competing view points, I arrange them in series that builds up a single hypothesis from its initial raw discovery to eventual improvement through testing. Because the discovered hypothesis comes from observations, it starts with some claim to external validity: the hypothesis is useful for describing the broad real world. Testing adds internal validity to the pre-existing external validity.
It seems to me that in my fictional philosophy there would never be a debate about whether statistical inference of observation can compete against findings from RCT. The reason is that both work along a single path using the same data. In my philosophy the testing is exclusively of hypotheses discovered from real world observations.
My philosophy is not consistent with the professional scientific views that the above lecture’s debate exemplifies. Scientific laboratory studies can discover and test hypothesis based purely on laboratory data. On the other side, scientific field studies use statistical tests to validate their findings of observational data. The two approaches can be completely independent and internally complete to present final recommendations to a decision maker. A decision maker inevitably is challenged to select the best of competing theories based on independent data paths.
In my philosophy of data, this would not happen because the only hypotheses would be the ones discovered from observed data. Testing will only strengthens the validation of discovered hypotheses.
I do recognize that many organizations are investing heavily on central data warehouses with the goal of simplifying decision making by having a single version of truth. Such data warehouses share with my philosophy the focus on data and their goal of single version of truth appears consistent with my philosophy. However, as I described in an earlier post I don’t accept the notion of a single version of truth.
When presented with the entire universe of available observations, we are going confront redundant but conflicting measurements. The ideal of a single version of truth requires discarding the redundant data and selecting the one determined to be most reliable. In my experience, I retain redundant information and thus confront the reality of multiple versions of the truth. This is why I suggest there is a need for a separate test of a controlled test to use new data specifically controlled to test the hypothesis.
In contrast, the organization’s single version of truth data warehouse is used for both discovery and testing. In essence, the organizational data warehouse behaves identically to the field-study science using statistical inference to discover and to test hypothesis. I insist that testing needs a separate process with new data specific for testing.
I am blogging a point of view that conflicts with common practice. The point of view is based on my personal practical experiences. But, in terms of relevance to common practice, my point of view may just as well be described a fictitious philosophy underlying work of social fiction waiting for an interesting plot. It is entertaining me to think about but probably has little applicability in modern scientific or data warehouse practices.