In my last post, I presented a positive case for hiring social science degrees based on their critical theory training. In that post, I described how critical theory provides a benefit in being able to identify the model (assumptions) within observation data. This can be useful to separate the models from the actual firm observation about the real world. When we employ statistical or other analytic models to make predictions or prescriptions, we benefit most from making those conclusions on objective observations that are not biased by our assumptions.
I use the term dark data to describe that biased part of the data that comes from models. I chose the word dark deliberately to invoke the impression that such data deserves scrutiny when our goal is to discover something new about the real world. By the way, in earlier posts, I described a complementary concept that I called forbidden data where we reject actual observations because they depart too far from our theories. Forbidden data is another way for models to bias our data.
Critical theory can provide a beneficial service to data science by providing skills to recognize biases in data where those biases come from our preconceptions. I argued that while this is an attractive quality for social science training, it is not as attractive as critical thinking skills to use evidence effectively to build a defensible argument. It is true that we need to be alert to our biased assumptions under data, but the more important project is to advance our correct understanding of the world through careful construction of an argument using the best available evidence. I feel that the distinct skills from classical rhetorical training of social sciences (before critical theory was popular) is more useful for data science, and currently hard to find.
In this post, I want to point out how critical theory can be unattractive to data science. Critical theory focuses on models. While it is good to learn of ways where models may be misleading us in our observations, it is not helpful to have new model data introduced into the observations.
Critical theory has a tendency of inventing new data. In my last post, I discussed another blog that illustrated critical theory with a short literary passage describing someone’s first person experience waking up in the morning. The information in the passage was simply that a person reluctantly got out of bed. I argue that this alone is useful data if there were some reason to use this data. The basic information was that on a particular sunny morning, this narrator got out of bed with some reluctance.
The critical theory illustration goes on to demand more data about the race and sex of the narrator. It points out that when we read the passage we subconsciously supply this additional information based on our own biases and experience. We do this to visualize the scene. I went on and suggested other biases about whether the bed and slippers could also be very unfamiliar to us or that the sunlight-determined day may be very different than our days because it occurs on a some alien planet.
None of this information is in the literal text of the passage, our actual observation data. Because there is no data about these aspects of the scene, our imaginations may fill in the details ourselves to complete the scene. A similar project occurs when someone adapts a book to a movie, the camera must capture more details than will ever be described in a book. Those details are infinitely variable, but only one will end up on film. That one detail in that scene is modeled data because the book provides no evidence to support that a chosen detail is better than any other detail. This is how, for example, we can cast a Shakespeare play in a modern setting. The play can still work even though the details we supply were not in the imagination of the author. The adaptation introduces invented data.
In data science, I want to isolate the observation from the model for data that is provided to me. As a data scientist, it is not my job to invent data. My job is to make the best use of data that is delivered to me. Data is second-hand by the time it gets to me. I described earlier that data science in terms of supply chain that delivers a product that represents some refinement of material provided to it. The project works with what it receives from its suppliers.
I want to refine the data to extract the parts that are useful for my project (such as hypothesis discovery, testing, prediction, or prescription) and set aside the data that would be counter-productive. For hypothesis-discovery, model-generated data is counter productive because model-generated data will reinforce previous theories.
The above example of critical theory demonstrates an invitation to invent new data that fills in the missing details. The goal of critical theory is to expose the biases in invented data but to make that point it must invent the biased data by going outside of the literal text. As I mentioned about the literal passage, there is no specific requirement that I must visualize the narrator as having a particular race, sex, or occupation. The passage stands on its own to describe a first person experience of getting out of bed. However, to make the point of critical theory, critical theory imposes a requirement to visualize the narrator and it provides a default prototype for meeting this invented requirement. Once we expose this prototype, we have the opportunity explore our disagreements with this prototype.
I find this aptitude of critical theory to be repulsive to data science. We work with data provided to us. Our job is to make the best use of that data to meet our goal. Part of that job is to identify the limitations of the data. If I were provided the example literary example and someone inquires about the narrator’s body-mass-index, the only correct answer is that I have no data on that. Instead, I fear a person trained in critical-theory would answer the question with “it depends, but 25 is a good guess”. The question can not be answered at all, and the answer of “it depends…” is not helpful. The data is completely silent on this information. The correct answer for data science is “you need to ask someone who has that data”.
This is not a trivial distinction. The process of critical theory analysis asserts that despite the absence of that information in the data, you might know the answer although that answer might be wrong. This assertion is an unnecessary introduction of dark-data, model-generated data that substitutes for a missing observation. For data science, there is no justification for this assertion. The suggestion that we might know something impedes progress by eliminating the motivation to seek out someone who might actually know. The correct response of finding someone who does know the answer provides the opportunity to incorporate this as a new data source. Each new irrelevant question is an opportunity to seek out a new data source that provides a definitive answer to the question so that we can answer a similar question in the future.
All college-level education involves training at different levels. The first level is to know the subject. A deeper level of training is to develop an attitude about how to approach the data. Critical theory, for example, is a valuable subject that teaches us that we bring our biases to our observations. However, the training for critical theory (emphasized by repeated application through multiple courses) is to fill in missing data with assumptions that in turn can be challenged as a bias.
The training predisposes the individual to answer the question unsupported by evidence with “I might know, but I might be wrong”. Because the critical-theory based curriculum repeats this lesson through many courses, this attitude becomes deeply ingrained. Critical theory can be a disqualification for a data science job, because it will be too difficult to unlearn this attitude of accepting debatable presumptions as substitute for missing data.
Most of critical theory focuses on interpreting written texts. That text may involve some work of fiction in which critical theory provides new ways to enjoy old texts in much the same way that cinematographers adapt old works to different settings. Other forms of text include historical accounts such as chronicles, biographies, legislation, or judicial decisions. These can also be subject for critical theory interpretations.
Data science includes text processing. For text processing, I think we can benefit from critical theory to inform us of algorithm biases such as might occur in sentiment analysis. However, much of data science involves numeric or descriptive observations provided by automated machine sensors. These observations come from non-human sources. Even though the sensors were created by humans, the successive or distributed observations from the same model of sensor will not have variation due to some bias that changes based on the situation.
Nonetheless, I see evidence of critical-theory attitude appearing in observational data. This is the dark data I complain about. It is deliberately inserting some assumption that we might know what occurred even though we don’t have an observation. This is the kind of data I would like to isolate from the what we actually observed.
An example of this attitude of might-know data occurs in the subject of astronomy’s dark matter. Dark matter itself is dark data because we don’t have any direct observation of what can supply this mass. Despite the lack of direct observations, we have good evidence in galactic motions and gravitational lens effects to know it must exist. The attitude of might-know comes in when we try to describe what this dark matter looks like. It could be a new particle (or family of particles), it might be something clumpier, it might be hot or it might be cold, it might decay or it might not. These conjectures appear to me to be similar to the literary example of arguing about the race and sex of a first person narrator. We obviously do not know but we suggest it is useful to make something up with the understanding that it is debatable. Data science benefits more from the modest acceptance of ignorance without any conjecture.
Data science projects are best advanced by accepting the fact that we have no data at all. The faster we can come to this conclusion, the faster we can seek out an appropriate source for the data. Also, a quick admission of our ignorance will allow us to protect the reputation of the data products (such as predictions) by recognizing the irrelevance of the data to answer certain kinds of questions. We risk embarrassment when we justify a quick response that we might know based on conditioning that teaches us that debatable guesses are useful.
For data science, it is counter-productive to insert debatable assumptions as substitutes for missing observations. At the very least, introducing debatable data interferes with our goals for velocity. There is no time for unnecessary debate.
In the current labor market, we debate the intrinsic value of various types of college degrees and disciplines. In recent years, we have elevated the value of STEM disciplines as being more valuable in the labor market than liberal arts and humanities. In my earlier posts, I made a case for my respect for the liberal arts and humanities by focusing specifically on their training in the rhetorical arts of preparing, presenting, and defending arguments based on available evidence. I think this is a valuable skill for data science even though the evidence is data and the interpretations come in algorithms and visualization. Although data science is largely automated with software, there remains a skill that STEM practitioners call story-telling. Rhetoric is a more appropriate term, but one that students in STEM fields get the barest of introductions. This essential skill is the skill to persuade a decision maker to act on what is presented in the analytics and visualizations. Without that final persuasion, the data scientist is offering nothing of value to the organization.
Data science projects need people who can build persuasive arguments. I believe such people are available from liberal arts training (including the social sciences and humanities). These disciplines have a distinct advantage for training rhetorical skills because their pool of evidence is so limiting. Typical studies in these fields have very small number of observations to work with in order to construct arguments to persuade others. In contrast, because the STEM aspects of data science involves very large amounts of data, the data is overwhelming for rhetorical training in a classroom setting. The easier way to learn rhetoric is to focus on arguments with intrinsically smaller data sets as those found in the social sciences and humanities.
I describe liberal arts in its classical sense, as liberal arts education was taught throughout most of history. In recent decades, the liberal arts have increasingly focused on critical theory at the expense of classical argumentation. While classical training focuses on building persuasive arguments based on evidence, critical theory directs its attention about invented data surrounding the actual evidence. The objective of critical theory tends to find reasons to dissuade arguments and to encourage arguments that can never reach agreement except to conclude that the answer is relative to the individual, culture, or circumstances.
As I mentioned, I find some value in critical theory to extract models from real-world observations. But overall my attitude toward the topic is negative. I think critical theory distracts from productive work in building persuasive arguments, and that is distracts the students in college from the productive training in classical rhetorical skills of building and defending arguments that do have a chance to persuade.