In recent times there are many political and social debates that end up getting characterized as science versus denier (anti-science, pseudoscience, etc). Often these debates are between different theories of how some aspect of the world works and the theories have different levels of supporting evidence. Our debates are to find the stronger theory based on which one has the best and more compelling data. In debate, it is legitimate to challenge the accuracy or relevance of the supporting data.
For example, in the global climate change debate, the group identified as deniers point to the recent measurements of global temperature have not been rising as predicted by models despite rising levels of CO2. In the evolution debate, the almost certain confidence that bacteria will evolve to become antibiotic resistant implies that the evolution progresses with deliberate intention to continue to harass man instead of a random variation that could evolve the pathogen to find a different environment.
My point here is not to take up these arguments but instead to point out the difference of arguing over theories and arguing over data. I am disappointed that often prominent defenders of a theory will denigrate those who challenge the data as being anti-science. The scientific aspect of the argument is about competing theories. The stronger theory is the one has has the strongest supporting data. The data support or undermine a theory. Data can be inaccurate or irrelevant to a theory. As a result, skepticism about the data’s accuracy or relevance is an essential part of the debate that makes science possible and distinguishes science from superstition.
In an earlier post, I attempted to identify where data fits in science. In that post, I made my own divisions of science into three parts: past-tense science (studying old observations with time stamps that can never be reproduced), present-tense science (careful operation of systems that produce new observations, and future-tense persuasion (more art than science of decision making). In practice there are a large number of fields that fall into past-tense and present-tense sciences. Often the same individual scientist will use present-tense science to conduct experiments to collect new data and then use past-tense science to evaluate those now historic time-stamped observations.
In my thinking about data science, I found it useful to distinguish the sciences by past and present tense focus of the professional’s activities. In particular, there is a synergistic relationship between present- and past-tense sciences. Present tense science employs theories from past-tense science. These theories permit engineering new things that can be used to challenge the theories with new observations. Meanwhile, the past tense science needs new observations to progress debates between competing theories or to invent new ones.
In the modern debates that often degenerate into name-calling of one or the other side as anti-science or science-deniers, the subject of the debate is usually about competing theories. Usually, these debates are within what I call past-tense science. There are multiple theories that attempt to explain nature and the different adherents are trying to argue theirs is the stronger argument. Advocacy or opposition to any theory is essentially what past-tense science is about. The debate involves rhetorical tools and such tools can include fallacies such as ad hominens. That’s all part of debating. I assume the participants are skilled in rhetoric to eliminate the fallacies.
Meanwhile, there is a different aspect of the debate involving the evidence itself. As I explained in my earlier posts, this is also a part of the customary practice of past-tense science. The perpetual problem with past-tense science is that it must make sense of very limited observations from the past. Those observations are always inadequate in some way. The evidence may have suspect authenticity. The evidence may have some error or inaccuracy. The evidence may not be relevant or be ambiguous. There is never enough evidence to satisfactorily settle a debate. A generation of debaters may settle a debate through intellectual exhaustion only to have a succeeding generation reopen the debate in light of new evidence or in new ways of looking at old evidence. Meanwhile, many very old theories persist over time despite major changes in the data that supports those theories.
In this discussion of present-tense and past-tense sciences, I would like to place our modern concept of data in the present-tense science and the theories about nature in the past-tense science. The data scientist, in my view fills a vital role in between the two sciences to scrutinize the appropriateness of data to a particular theory. One consequence of this view is my recurring disapproval of model-generated data. Model generated data uses our past-tense biased theories to provide substitutions for missing observations. Our subsequent analysis of latest observations cam be misled into concluding new support for a theory when in fact the supporting evidence came from the same theory. My preference is to keep model generated data out of new data. All new data should be as-measured data. This forces the past-tense scientist (analyst) to confront the omissions and anomalies in context of the specific theory he is working on.
In earlier post, I presented my idea that we should rename the commercial profession of data scientist to data clerks. This places the emphasis on scrutinizing the actual data by linking the profession to accountants and related professions. In our current usage of data scientist as a subset of computer science, we imply a legal immunity to data scientists. Computer scientists make products that delegates responsibility to those who use those products. Computer science still retains a notion of “use as your own risk” disclaimer of their software. They can assure us that the software meets some specification but can not assure us that the use of the software will meet our needs. We need to expect more responsibility from the profession we call data science.
In contrast to computer science, the data clerk would be held accountable for the actual data’s accuracy and relevance to the task. I imagine a data clerk having a similar professional liability as a certified public accountant to attest that the data meets the standards for the task it is used. Using the analogy of the accountant, the accountant will assert that a financial statement presents numbers following the best practices. The accountant’s product, such as a the financial statement, does not evaluate the viability or future prospects of the company. Instead it merely presents the current accounting of assets and liabilities. My point is that the accountant takes responsibility for the data, not for the fate of the company. Similarly, a data clerk should take responsibility for the data, not for the theory the data may support or refute.
I think data clerk best describes the essential data scientist. The clerk term describes the integrity and trust we most want from these professionals.
Most data science work is computer science of implementing algorithms to map and aggregate data. The computer science mentality is to hand the processed data to a decision maker to use at his own risk. Ultimately, the computer scientist will take responsibility only for the code, not for the data.
In addition, I argued that the science part of the term data science implies that the profession requires advanced college research degrees and accompanying reputation of the researcher. Certainly there is a need for such research, but that demand is small in comparison with the industry’s more urgent need for accountability for data itself. When some computer system presents extremely compelling analysis and visualization, we need someone to take personal responsibility to assure everyone that the presentation represents reality instead of representing clever programming. We need someone’s signature at the bottom of the analysis where the person accepts full responsibility with risk of pain of loss of reputation. This person will be responsible for the interpretations of the final analysis presentation and visualization. The concept of accountants or clerks better describes this responsibility than either computer scientists or research scientists.
For this post, I offer another reason to retire the term data scientist in favor of data clerk or dedomenologist. Data antagonizes science like anti-matter antagonizes matter.
Above I mentioned the interdependence yet competition between present-tense and past-tense sciences. New observations from present-tense science can contradict theories that come from past-tense science. Sometimes the present-tense science deliberately seeks out this conflict, and succeeds. Past tense science does not like this but eventually it must come up with theories that acknowledges this new information. In this scenario, the past-tense scientists, the custodians of theories, objects that the present-tense science appears to be anti-science. They are correct in the sense that the new observations damaged a theory, and that the popular notion of “science” is about theories, not data. Theories are human stories that we assert are trustworthy characterizations of the natural world.
Any attack on theories is an attack on science. Data attacks theories.
Recent popular debates seem to place all of science in the past-tense science. This view is that the totality of science is the body of theories that we have accumulated. When we teach science in schools, we teach these theories. To the extent that we teach evidence, it is historical examples of evidence that provided key support for the theories. It is only in advanced research studies (PhD programs) where students will begin to learn about collecting new evidence that can challenge theories. It is not surprising that the population equates science to theories. What they learn under the name of science is mostly about theories from the past-tense sciences.
There is a sense that in recent times, theories are increasingly attacked, criticized, or even dismissed. Because the popular definitions equate theories to science, we interpret this as an increasing hostility to science. Using this popular definition, we immediately identify the usual suspects for enemies of science: religion, superstition, pseudoscience, trickery, etc. These are historically perpetual enemies of science. They have always been around and appealing to roughly the same portion of the population. There must be another explanation for the recent increase in questioning theories than increasing ignorance or gullibility.
Coincidentally, we are living in a technology revolution in terms of collecting, retrieving, and analyzing data in high volumes, velocity, and variety (3 Vs of Big Data). This revolution is occurring at the same time the previous Internet revolution has matured to permit rapid communications across complex networks of inter-relationships.
Today, our access to data is unprecedented or even incomparable to what was available even in the recent past. Individuals with meager Internet connections have access to vast amounts of high quality data recording observation from nearly everywhere. Recently, we have begun to use the phrase Internet of Things (IoT) to describe the variety of possible sensors that are being deployed in great numbers. In the extreme futuristic vision of IoT, every thing will be a sensor or have a sensor devoted to it.
Humans have a natural capacity to figure things out from their own observations. Very accurate archers or sling-throwers who existed millennia before Isaac Newton came up with a science to describe projectile motion. These projectile-throwers of the past understood the essential science of the motion to obtain lethally accuracy. They lacked a human-language explanation of this science. What we today value as science is not the inherent understanding of nature, but instead the elegant expression of a theory in a human (or mathematical) language.
For a long time, human language descriptions of theories were far cheaper than obtaining observations to figure things out individually. The accurate projectile throwers above needed long and extensive training not only to develop the right physical capabilities but also to observe the variety of target scenarios they may confront. A theory permits engineering and ultimately computers to replicate this knowledge without requiring each individual to invest in expensive data collection in the form of training.
The ability to figure out how nature operates is a natural talent for most humans. What made science hard and scientists rare was the difficulty of translating natural laws into human or mathematical language. Until recently, scientists also faced a tedious and time-consuming task of obtaining observational data for the theories. In contrast to the trained thrower who can immediately adjust his form from what he observes with his eyes, the scientist needs measurements that can be recorded and often this required specialized (and often expensive and delicate) instruments. Science was hard and expensive. The ultimate goals of science, to understand how the world operates, are not hard. Creating a narrative to community the theory to others is hard.
Nearly everyone can figure out how nature works and can make predictions once he has access to relevant and accurate data. Until recently, only people trained and credentialed as a scientist had access to relevant and accurate data. That exclusiveness gave science its value. The market for theories was that it provided the population with access to an explanation of how the world works when they personally did not have the resources to observe the relevant data for themselves. The high value of theories in this market was because so few people had access to data to make theories.
Theories are not essential to understanding the world. We can learn for ourselves an understanding of the world without the ability to explain it to others. The ones who win in competitions (whether it is warfare or in running businesses) are likely the ones who have a better understanding of how the world works.
Despite the achievements of science to explain much of the natural world, we are still mystified as to why some people are better leaders and succeed more consistently than others. For such leaders, we have no theory that explains their success in a way that assures success to others following the same theory. We attribute their success to their talents. We mourn their deaths in large part due to our loss of those talents because the leader was unable to transfer that understanding to someone else. Somehow, these individuals were able to understand the world through their own observations but in a way that they were unable to express in human language. They understood but that understanding is not translatable into a story others can read.
The recent revolution of data, Big Data and IoT, on top of the matured Internet communications revolution has made possible something very new to human experience. Today everyone has cheap access to abundant data of every conceivable variety. We can now observe for ourselves virtually everything with very little marginal cost over the cost of living. For example, I can at this moment see what weather is occurring right now on some location on the opposite side of the globe: not just the forecast, not just the current radar images, but also live cameras for a sample location. I can do this with virtually no cost to myself and may do it out of mere curiosity or even boredom.
My point is that access to observations is cheap and widely available. People can see data for themselves and come up with their own conclusions. Being human, their conclusions are likely to be effective in terms of understanding something about the world even if they can not explain it to others in human language.
I wrote in an earlier post, I suggested that if even the first humans had access to modern data technologies, they probably never would have bothered to come up with scientific theories. The data can tell all they need to know about how the world operates. Simple statistical tools can create trends that can consistently predict the future at least in the short term that matters for most human activities. Even spurious trends that we find ridiculous today can provide some predictive power that over time may provide more benefits than costs.
My concept of a dedomenocracy, a government purely by data, is based on this notion of making all decisions on data alone absent of any human theories or concepts of Truth. I suggested that this government may work better than human governments (including modern democratic ones) if we permit government by data to make rules rapidly based only on the available data, but also where the rules are short-lived and have no constraint for consistency or honoring precedents. The government will decide what needs to be done right now based on data that identifies the highest priorities and that identifies the most promising predictions. I explain that such a government would be unique to human history because it will demand a strict form of authoritarianism (with similarities to a theocracy) with a restraint of focus on very few rules for short periods and thus rarely affecting any particular sub-population (with similarities to a libertarian government). My point of mentioning it here is that this concept of government dispenses with all human theories entirely.
Dedomenocracy is the ultimate antagonist of science because it uses data technologies to create enforceable rules with no input from scientific theories. Instead of relying on theory product of past-tense science, dedomenocracy employs present-the repository of all observations from present-tense science. A definition of science as accepting all theories would accuse dedomenocracy of being anti-science. However, this is not fair because dedomenocracy uses evidence in the form of data. It simply discounts any value from human cognitive theories. Because it avoids human foibles entirely, I would argue that dedomenocracy is more scientific the strict obedience to accepted human theories.
When talking about dedomenocracy, I am extrapolating from recent trends to a future point where we would have access to far more extensive that the data we have today. This would be a data store with history of observations of virtually everything that is significant to humans. I’m also extrapolating continued improvements in data query capabilities to permit rapid retrieval of relevant data sets to feed statistics-based analytic algorithms to make predictions or prescriptions. It in this future capability that I suggest we would no longer need theories about how the world works because we would be able to retrieve observations of similar cases to estimate what would happen.
For example, with sufficient measurements, we can replace Newton’s physical laws of motion by querying observations of moving objects in similar circumstances. If we want to predict how an object would fall to the ground when released, we can query for observations of falling objects with time-stamps for different times of the fall. In the future, there would be a huge number of observations to work from and even if the data for particular falling objects may be incomplete, we can combine all of the observations to derive a trend of how speed changes at different positions of the fall. There would be a statistical trend line that can tell us what to expect. In addition, the algorithms can perform sensitivity analysis to show that the motion is independent of mass. We can use this statistical result to predict what will happen for a future instance of a falling object.
I realize this is ridiculous example and the laws of motion are simple enough to compute now that we already know them. I use it as an analogy for approaching any problem by querying relevant data and performing analytics to predict what will happen. In the motion example, in effect the data engine is repeating the past-tense science of coming up with a theory. The difference is that the theory is never explicitly stated in a human narrative. Even the computed trends may be hidden. All the human decision-maker needs is the final prediction with corresponding data about confidence and predictive skill.
Although we may never use data this way to replace our classical theories of motion, we will likely use this approach for more complex problems where the theories are more contentious. An example is the theory of global climate change due to CO2 concentrations. Given the historic data that we observed until recently, we could estimate a continued growth in global temperature as CO2 levels increased. Because my concept of dedomenocracy is authoritarian, it does not need a human narrative to name and explain the theory. It can conclude from this data an immediate need to control CO2 to stop the temperature rise. In recent years, the relationship has changed where now the temperatures are stable despite continued rise of CO2. My concept of dedomenocracy is that it will make frequent new rules to replace old rules because rules are only temporary. Also, dedomenocracy is free to make rules that contradict older rules. In this example, the dedomenocracy may conclude from recent observations that this is no longer an urgent issue requiring regulation. Because the initial rule never had a human cognitive theoretic explanation, we would not object to this change in rule. The change in rule is justified by the change of evidence. Also we take comfort in the fact that because dedomenocracy makes new rules frequently, it can quickly re-introduce a regulation as soon as observations confirm the original trend of rising temperature for CO2 levels.
Making decisions based on data alone follows the above pattern of working only with observations and not with theories. The dismissal of human theories appears to be anti-science, but the process of working from recent (and historic) observations to derive new trends is a very objective approach. This is evidence-based decision making instead of theory-based decision-making. I suspect that the purely evidence based approach will outperform the theory based approach, especially if we allow evidence-based approaches the agility to respond to the latest observations. The problem with the theory-based approach is that it takes too long for human consensus to change to reject an old theory and replace it with a new one. It is easier if we never bother to make theories in the first place.
In this sense I see data as antagonistic to the concept of science being a body of theories. In addition to the classic challenge of new data potentially disproving an old theory, the modern reality of practical data technologies makes possible decision making based on data alone without any need for human cognitive theory to justify the decisions.