It was after my last job ended that I had to find a way to describe what I was doing. Although my activities touched many so-called specialties, I found the term data science to be most inclusive. I was doing everything that I was doing with the objective of understanding data. That was an appealing choice because the concept of data science is very popular with lots of articles saying how much these creatures are in demand. After catching up with the usage of the term, I became disappointed that I don’t really belong to that crowd. Data science may describe everything that I was doing, but it fails to describe the objective that drove what I was doing.
I didn’t think that the actual practice of applying technology was all that special. The technology doesn’t require any distinguishing skills. The popular term data science is nothing more than what we used to call computer science back in the 1970s/1980s when the focus was on algorithms instead of user interfaces. Data science is computer science. I really don’t see the point of making the distinction.
I probably mentioned in some earlier post of a conversation I had long ago manager of geologists who said that he considered computer science as a skill similar to being able to drive to work. Yes, it takes some training and discipline, but he assumed any scientist in the field would know algorithm programming. He felt little need to probe about programming skills. Even today with much easier to use programming environments, computer scientists don’t realize that scientists can program and invent algorithms.
A geologist is interested in the facts about earth sciences. An anthropologist or sociologist is interested in the facts about human nature. A naturalist may focus on a particular species or a particular ecosystem. Some of the scientists within their ranks take up the task of writing programs to create algorithms to better make sense of the data.
As I have been writing this blog, I noticed that I’ve been taking a deeper interest in the nature of data itself. Using the analogy of the scientists, I began to fancy myself as a naturalist of data.
But even that is easy confused with modern hype of something called Big Data. Big data is presented as a science about data. However, there is a distinction in that focus on the word big and the implied plural of data. Big data is about the three V-words: volume, variety, and velocity. Big data is about dealing with lots of data quickly.
My professional experience is calloused with work the three Vs. It’s hard work. I appreciate the attention to distinguish this is somehow distinct from other types of work. But I was not motivated by the fact that there were three V-words. The three V-words needed to be managed with old fashioned computer science.
What motivated me was the datum. The single observation of fact even when delivered in a pile of billions of similar observations of facts. The software I wrote and organized needed to handle the large volume of data but with a high degree of respect for the individual datum. Each observation of fact deserved the respect of having an identity.
I didn’t invent this respect for the datum. This respect underlies the basic objective of good databases with its concepts such as ACID (atomicity, consistency, isolation, and durability). Good database design involved an inherent respect of the datum as a singular observation of a fact. However, modern data science does half of its work outside of databases as captured by the definition of NoSQL standing for “not only SQL”. Even if the technology is different, the datum remains deserving of full respect as implied in the ideals of good database design.
To capture this focus, I made up the word dedomenology as the naturalist study of datum. In several earlier posts such as this one, I elaborated on how to recognize different information qualities of these observations. In particular, I focused on the different capacities of data to capture an authentic observation of a fact of the real world. I described as bright data an observation that was well controlled and documented so that its observation is completely unambiguous and authentic. Real data has some level of dimness to it. I described other data as dark data when it was invented (using models) to supply what a missing observation should be, or as forbidden data (again using models) to reject real observations that fall outside of what it should be.
The point of these earlier posts was to explain what constitutes the labor of what I then called a data scientist, but now wish to call a dedomenologist. The labor may involved many disciplines of computer science and databases, but that labor is specifically directed toward preserving the integrity and respect for each individual datum, each individual datum purports to be an observation of fact.
In reading the current literature of the profession of data science, I find a frustrating lack of attention to the veracity of the data. There is a sense that veracity is something to be addressed once and for all very close to the source. We demand perfect sensors or we demand exacting processes of extraction, transformation, and load (ETL). Once data occupies a data store there is a desire to employ algorithms that treat the individual datum as peers in veracity. The primary challenge for algorithms is to fight the battles of volume and velocity.
I noted earlier that the popular definition of data science is the science of fighting the data challenge of very large or very fast. From my experience starting in the early 1980s, this fight of data versus resources is what defined computer science. It is a worthy, challenging, and rewarding discipline that I’m proud to be a participant. However, I get the lingering impression that the computer science is blind to the issues of veracity of data.
As I discussed in several earlier posts, I described the problem of veracity as the problem of the individual datum standing up to intense and prolonged scrutiny. I focused in on the recognition that data is identical to a historical artifact. Even data of an observation a fraction of a second earlier is a non-reproducible piece of evidence of a specific point in time. We don’t look to data scientists to address veracity. Computer scientists may be involved, but they are far from the only interested parties and usually far from the most respected.
Veracity is the domain of the historical sciences such as history, archaeology, paleontology, or of the legal concepts of auditors, crime scene investigators, discovery lawyers, courtroom lawyers. The project of veracity of even a single datum invites potentially lengthy scrutiny by multiple parties. That scrutiny is debated through rhetoric and logic. The project of veracity may never be settled. For evidence of that look at the history of cycles of settling and reopening debates about very ancient events based on the same recorded evidence. We are still debating how exactly we should understand the evidence of the Peloponnesian wars of 2500 years ago.
In earlier posts such as this one, I pointed out my respect for medieval church thinkers for their efforts in preserving and refining the rhetorical tools for examining the veracity of datum. Their data may have been limited to religious inspired texts, but their project was about diligent examination of that data before coming to conclusions about their immediate decision making. Much of modern disciplines of scrutiny of veracity inherit from these earlier practices.
Until recently there was no problem of appreciating the roles of the historical sciences to examine veracity of evidence because the evidence was rare and infrequent. In fact, this rarity of evidence motivated us to study each piece carefully to extract the maximum information from it as possible to use for or against an argument. We have a long history that demonstrates the value of this scrutiny when a renewed scrutiny finds something critical concerning the strength of a particular argument.
There are plenty of examples in law as well as in the sciences where old evidence is found to be more or less important than it was interpreted originally. In some cases, this refinement of the interpretation may have been enabled by computer science, but the interpretation itself came from expertise and debate within the relevant disciplines often far away from the computer scientists.
In modern practice, we appreciate that there is a need for determining the veracity of data. In data projects, we often identify the role of domain experts who understand the context of what the observations are attempting to capture as data. A data science project of data about some factory workflow will involve domain experts of manufacturing processes, plant operations, logistics, labor work rules, and so on. On top of expecting data scientists (once called computer science) to solve the problem of fighting the volume and velocity of data, we expect the domain experts to scrutinize the validity of the data throughout the life cycle of the data. We need domain experts to tell us that the input observations are valid. We need domain experts to tell us that the final reported conclusions make sense.
The ability to make sense of data is outside of the expected expertise of data scientists. Unfortunately, the renaming of computer science as data science encourages us that maybe these practitioners have inherent domain expertise.
This occurs because virtually any domain of knowledge has some common understanding. Especially with modern education and easy access to search tools, it is pretty easy to get a basic introduction to virtually any area of specialized knowledge. A person assigned the task of being a data scientist can read up on the basics and at least be conversant with appropriate use of the key terms.
This is not necessarily a bad thing because the project of interpreting evidence depends an debate with diverse viewpoints. A casual understanding of a domain of knowledge can provide a useful counter argument to encourage the expert to explain key concepts in a way that it easier to understand, for instance. My concern is that often we assume the elevation of data science as being something more than computer science means these practitioners bring sufficient domain expertise.
I return to the term dedomenology that I earlier distinguished from data (or computer) science. I also want to distinguish dedomenology from domain expertise responsible for assuring the veracity of data. I am interested in focusing on the nature of the datum itself. In addition to wondering what makes data possible in an earlier post, I also want to understand different kinds of datum in terms of what it can say about what actually happened in the real world. In particular, I tapped my earlier experience in simulation to raise suspicions of preconceived models manipulating the observations so that the recorded data confirms those concepts in preference to a credible observation of a real event that could contradict those concepts.
Dedomenology is about distinguishing data in terms of how well it captures an actual event in the real world with solid documentation and control so that we know exactly what is being reported. The idea that data could range from being very bright (it really did happen) to very dim (we’re not sure what happened) to being dark (we guess this might have happened) to forbidden (this could not have happened) to unlit (something found nearby), to accessory (irrelevant observations). All of this variety of data can occupy a spot in a data store. Once in a data store, it becomes a candidate dimension or measure to use for some machine learning algorithm.
Dedomenology provides us a way to restrain ourselves when we attempt to find discover new hypothesis (such as promised by predictive analytics) as I described in this post where I suggested someone may find a prediction of some medical procedure outcomes based on the color of clothing a patient wears during a routine doctor’s visit. I countered this deliberately ridiculous example with the somewhat more accepted data point of body mass index (BMI). Both are easy to measure and thus occupy some location in a data store. Both are inferior to more relevant but harder to obtain observations.
Dedomenology is my term for the project of focusing on how to compare the relative merits of different datum in terms of how well it captures what really happened at a particular time. Dedomonology is also about setting our expectations for investing labor in scrutinizing or suspecting the information content of the datum.