Modern data science focuses on the technologies for obtaining, cleaning, managing, aggregating, verifying, securing, and analytic-processing the data. The challenge for data scientists concerns capacity issues: pushing technologies to grow the 3Vs of big data: volume, velocity, and and variety. There are many technology specialties within data scientists. These technologies are rapidly maturing in terms of being supported by commercial vendors and of having readily available training for certifications of various skills.
Recently, I have been sketching a concept of government by data (a dedomenocracy) that applies data science to the problem of government. In such a government, automatic analytics would be the sole source of generating the number and content of new rules for the government. Despite this automation, there remains a human role of data science that is open to the population. I imagined that the future citizen will be skilled in data science instead of in democratic debate. A dedomenocracy changes the nature of topics for popular debate. Instead of debating the policies, the population will debate the data.
In my last post, I objected to what I called spark data: data deliberately introduced for the purpose of distracting the algorithms to some irrelevant topic in order to avoid painful choices available for the more relevant (and urgent) topics. In a dedomenocracy, the high frequency of short-lived rules inherently limits the number of rules that the population can be reasonably expected to follow. If spark data succeeds in distracting the dedomenocracy to produce an irrelevant rule, then that rule can displace a more relevant rule. As a result, I recommended that the population needs skills to recognize spark data and keep that data from entering the store available for analytic rule making.
In earlier posts, I wondered about the relevance of classical education to modern data-driven world. I proposed that data science is part of the historical sciences of interpreting evidence. The classic historical sciences (history, archaeology, forensics, etc) are specialties within a broader science I called dedomenology, the study of the datum. All of the historical studies share a common problem of interpreting the evidence available today that applies to the past events. Although evidence takes many forms and is often physical, the evidence is data. Historical studies are about arguing about the data, its relevance, its reliability, its degree of separation from the topic, its accuracy. It is also about arguing about the missing data and how missing data is filled in with suppositions such as assuming ancient civilizations were populated with humans who shared some basic traits of humanity (emotions, motivations, etc.) as ourselves.
At the core of this commonality of historical sciences is the discipline of proper argumentation that became the focus of scholastic education of the liberal arts such as the trivium. Proper argumentation involves identifying and avoiding fallacies. Using the trivium terms, fallacies can be either grammatical, logical, or rhetorical.
Modern education especially in the STEM fields that educate most data scientists have dismissed the trivium as relevant to practical skills. Practical skills are analogous to the quadrivium: arithmetic, geometry, music (or ratios and relationships), and astronomy (observations). In data science the practical skills may be a quad consisting of technologies (computer languages and practices), data processing (ETL, query, visualization), data modeling, and analytics (statistics or machine learning).
Data science represents a mastery of the modern quadrivium. However, in classical education, the quadrivium education comes after the trivium. In modern times, our quadrivium comes first. Perhaps this is a necessity because our technologies are vastly more rich than the previously taught concepts of arithmetic, geometry, music, and astronomy. Proficiency in the modern quadrivium demands extensive study that we’ve pushed all the way back to preschool age such as giving preschoolers access to age-appropriate computers so that they can begin practicing programming concepts. International economic competitiveness depends on the ability to produce globally successful technological innovations. These innovations require skills that are the focus of the modern education.
These technological skills (STEM skills) are vitally important for continued economic success of the nation. To continue to be successful, we need to continue to educate for the best of these skills.
However, I wonder if we are missing the historical lesson from our ancestors who felt that the trivium was more important. The trivium was taught first. Also, the trivium was sufficient for a liberated man. The quadrivium was a specialization that provided additional benefits but it was optional. The critical skills for liberation was the skills of thinking critically in order to produce valid arguments. On a personal level, a person’s reputation depends on his ability to make a valid argument. On a social level, society benefits from valid arguments and it suffers from invalid arguments. An invalid argument contains a fallacy. A simplistic definition of the trivium is that it teaches how to identify fallacies in grammar, logic, and rhetoric.
Modern data science is not interested in trivium fallacies. If the term fallacy comes up in discussions, it is in a more practical sense of validity of data such as in the data cleaning project of ETL or of distinguishing causation from correlation in analytics. I place these as quadrivium-level skills for practical applications of data.
Trivium fallacies have little relevance to data science. For example, there is nothing comparable to grammar in data: as long as data is verified, it is interchangeable with any other data. We do not categorize data as being limited to how they may be used such as having different words for grammatical nouns and verbs or word-forms for subjects and predicates. Although data science includes computer science based on computer logic, this logic does not include the higher logical principles such as valid syllogisms. Finally, the trivium’s concept of rhetorical appears completely irrelevant to modern data science.
In modern data science, once a datum enters a data store, it is a peer with any other datum in that store. There are no grammatical, logical, or rhetorical constraints on usage of the datum in analytics.
This morning, I read an article that describes a government health department form for registering a live birth where the form allows the mother giving birth to identify as a male:
“To be clear, it is possible for a person who has given birth to a child to identify as male,” Susan Sommer, a lawyer for Lambda Legal, an advocacy group for lesbians, gay men, bisexuals and transgender people, told the paper.
She said that given various transgender stages, there is room for the person who gives birth to check the male box.
This a medical health form that provides data for medical health records of both the mother and the child. This form allows for the independence of two pieces of information of “mother giving birth” and “sex of this person”. This form can be explained as a sloppy construction from of a simple Cartesian product of the two options for parent and the two options for sex of individuals. However, the modern recognition of same sex marriages presents the possibility that both parents are of the same sex and that the mother giving birth self-identifies as a male. There is no constraint for this data, and ultimately this will become a permanent part of the public health record that will persist even until the child’s death from old age. Modern data science does not object to this data on grounds of any fallacy. The data is valid as long as it can be traced to an official form that is properly witnessed and signed. In terms of modern data practice, this data will be as valid as any other birth registration record. In modern practice of fluid gender identity and same-sex marriages, any combination of responses will meet the modern business rules for this form.
Eventually, there will be an analysis of medical outcomes that will consider the medical data about a person’s parents and that analysis will encounter a person with parents of the same biological sex and where the mother giving birth is a male. I don’t doubt that some machine-learning technique will find a way to use this information but it will isolate this subgroup from children who have traditionally defined parentage.
For example, an algorithm that attempts to quantify a health risk based on health history of the biological father may recognize that the self-identification of the form is not the biological truth, but it will still separate this subgroup from the traditional group of persons with an explicit designation of an unknown father. This new subgroup has an assertion that the father happens to be the other partner who is female and identifies as female. This subgroup is a proper subset of people who have unknown biological fathers, but isolating the subgroup reduces the algorithms effectiveness because it reduces the population in the set of explicitly unknown biological fathers and it must treat this smaller subgroup separately as some set of people with known fathers who can not be biological fathers.
In modern data science, the analytic algorithm must work on a collection of datum that has no grammatical, logical, or rhetorical rules.
The medical record example is relevant to my earlier discussion about dedomenocracy. The government is taking a larger role of managing the health care options of the individuals in the population. Increasingly, this management is using data science for evidence-based medical care. For a particular patient’s care, the rules must follow the evidence based on the peers of that patient. (In the above example, the unfortunate child belongs to a group with parents of two females with a birth-mother identifying as a male. The rule for access to care for this child will be constrained or informed by evidence of children with similar parental data.) We are currently building a dedomenocracy for the specific issues related to access to health care.
For an example is the government defining medical necessity qualified for insurance. Although these are for government insurance programs such as Medicare or Medicaid, the broader insurance market will follow comparable definitions. For the specific case in the above-linked article for Oklahoma:
2. Documentation submitted in order to request services or substantiate previously provided services must demonstrate through adequate objective medical records, evidence sufficient to justify the client’s need for the service;
Evidence is data. Although the current intent of language like this may focus on published scientific findings, this specific example identifies objective medical records that I read as the specific medical record for the specific patient. The evidence must not only be a scientific finding of appropriateness but also that this specific patient belongs to a group that will benefit. Data will determine whether a procedure is covered and consequently whether that procedure will be available to the patient. With big data 3Vs, this objective data eventually will include the birth registration data to include in the determination of the patient’s need for service.
Dedomenocracy needs more data scrutiny than provided with data governance and other data standard that accepts data uncritically so long as it meets the the programmed business rules. Good clean and acceptable data in context of modern data science practices may still be wrong data to have in the data set. I described this additional scrutiny as a practice that is analogous to classical rhetoric but applied to data instead of arguments. Data should meet tests against fallacies that apply to data like errors in grammar, logic, or reasoning are fallacies in arguments. The above example of a medical health record of a birth with same-sex parents and the mother identifying as a male is analogous to a grammatical error even though the data itself meets the business rules for the form. We should be able to object to this data as valid to use for some purposes such as determining eligibility medical necessity for health services just like we would reject a grammatically incorrect sentence in an formal argument.
As we move toward automated decision making based on data, evidence-based decision-making, or dedomenocracy, we need to have opportunity to reject fallacious data comparable to grammatical, logical, or rhetorical errors. For this post, I want to focus on what I consider to be comparable to grammatical errors of data. A grammatical error for data is using particular datum in a way that makes no sense even though the datum meets the latest governance and business rules.
Another example of a grammatical error of data is in the identification of weather forecasters. In a recent post, I described that the state funds weather forecasting with the justification that forecasters will provide accurate predictions so that city and emergency planners can prepare for bad weather or avoid such preparations when it is unnecessary.
Seeing that there has been no accountability held to recent bad forecasts (downplaying an event that turned out to be a big problem, and over-playing an event that turned out to be minor), I would be inclined to describe the above definition of weather forecasters to be a grammatical error. People employed to provide weather forecasts are not accountable for making bad forecasts. City and emergency planners do not rely on weather forecasting for preparing for weather events. The state still funds weather forecasting but the justification must not be to aid in planning because there is no accountability when the forecasts are wrong. In the mentioned examples, the city and emergency planners were held accountable for their actions or inaction but the weather forecasters were protected from criticism or accountability.
There is another grammatical error lurking in the job category labeled as weather forecaster. Weather forecasting has a long history where until recently humans made their own predictions of future weather based on raw data available to them. The historic weather forecaster would use measures such as temperature, season, cloud forms, barometric pressures to make his own prediction for what will happen. In contrast, the modern weather forecaster relies heavily on simulation models to determine the forecast. A modern weather forecaster select the model he likes best or blends the results of a multiple models and presents that as the forecast with some expression of uncertainty because the models disagree. The grammatical error here is to equate the modern weather forecaster with the historic weather forecasters. The problem is that there is a data entry (specifically a job position) that persists over time that defines a weather forecaster but the modern job is completely unlike the old job. An old-timer forecaster would make his own independent assessment and would accept personal accountability for his prediction by offering a defense for why it went wrong. A modern forecaster makes no independent assessment but instead picks a simulation run and then defers any criticism to imperfect computer algorithms working with incomplete data. It is a grammatical fallacy to equate these two professions by using the same label. It would be better if we would retire the job category of weather forecaster and replace it with a new term such as a gamer of weather models.
There are many professions today using the historic names even though the modern job has fundamentally transformed. From a dedomenocracy perspective, it is dangerous to allow these labels to persist when they mean drastically different things at different times.
Another example is airline pilots. There is over a century of usage of the word pilot meaning a person who has direct control over the aircraft he is flying. The aircraft itself is designed to be controllable by manual controls. In recent decades aircraft designs have become more complicated to meet market needs such as better fuel economies. To make these designs easier to fly, these planes have on-board flight augmentation computers to select the best configuration for the current conditions in order to maintain control of the aircraft while optimizing fuel usage and flight times. We still call operators of these aircraft pilots because we are told that the human pilots is fully trained and capable of flying the aircraft manually without losing control. This is what was required for flying older aircraft. With newer aircraft, the pilot’s job is easier because a computer is controlling most of the flight. We continue to call these operators pilots because we are assured that they could turn off the computer at any time and fly the aircraft without losing control.
The recent tragedy concerning Air Asia flight QZ8501 crashing into the Java Sea gives evidence that this job description is incorrect. Recently released information states that the operators of the flight disconnected the flight computer and attempted to fly the aircraft manually but soon lost control of the aircraft. From the article:
Disconnecting the FAC removed a host of features including the critical the cockpit speed warnings and protections. It also makes the A320 “harder to fly” according to an A320 Check and Training Captain.
At least for that crew and in that weather condition, the plane proved not to be flyable manually. I suspect that no human pilot would have been able to maintain control in that condition. The plane needs the computer to make fast computation of a large number of parameters to retain control of the aircraft’s motion through the air. My guess is that that airplane was unflyable without the flight computer. If that is the case, then the operators of the plane are not the same kind of pilots as their predecessors. They may in fact be as skilled and as well trained as their predecessors but they are in a craft that can not be flown manually, especially when in severe weather conditions at high altitudes.
The point of this example is to complain about the datum that uses the same label “pilot” for the cockpit occupants of a modern jet like the A320 that we use for those cockpits of much older jets that are manually flyable though less efficient. The job changed because the air-frame changed. In a modern jet, the occupants of the cockpit are operators of flight computer that is necessary to keep the plane flying. These operators may know what to do if they had to attempt to fly manually, but they are unlikely to be successful, or at least they were unsuccessful in this instance.
I’m not complaining about the pilots, the airline management, or the plane’s manufacturer. As far as I’m concerned, I see no problem having a craft that must be flown by computer. I am complaining about the datum of the label of “pilot” for a craft that can not be controlled manually without the flight computer. The data-grammar for pilot based on historical usage is that he will be able to control the plane manually without any computer assistance. We need a different label to describe the cockpit operators of a craft where the computer is essential for flight.
In the above discussions, I am describing as a grammar about consistent usage of terms in the data. The inconsistent use of a term represents a kind of fallacy at a grammatical level for data. The identified parents of a child on a birth registration should consist of a biological male father and a biological female mother who gave birth to the child. Revising the business rules to permit more flexibility to allow female fathers and males giving birth is a grammatical error. Weather forecasters who merely select predictions from multiple simulation models are not the same as weather forecasters in the past who made their own independent forecasts. Pilots of modern jets that can not be flown without a flight computer are not the same as pilots of older jets where flight computers are clearly optional or not present at all. The modern usage of old terms data is grammatically wrong when it contradicts older usage.
Postscript: as I write this the long predicted precipitation has arrived locally but causing bad road conditions more consistent with forecasts made five days ago than the forecasts made just a few hours ago. The models got it wrong again. True to their modern profession, the forecasters are making excuses for the models instead of taking personal accountability for the forecast.
4 thoughts on “Grammatic fallacies in data”
Pingback: More fallacies in data: inequality of income and employment | kenneumeister
Pingback: More thoughts on fallacy potential in negative categories | kenneumeister
I realized the moment I published this that the title has a spelling error (a form of grammatical error) using grammatic instead of grammatical. I resigned to keep the error in place because it became part of URL slug. This is not the first example of sloppiness on this site and it won’t be my last. I did attempt to proof-read this post, but I tend to ignore proofing the most important line of all: the title.
I’m trying to think of a way to rationalize what “grammatic” may mean in this context. I’m not aware of any practice of calling a grammatical error a fallacy so the phrase grammatical fallacy is itself an invention anyway. Calling it a grammatic fallacy draws attention to the invention. But I admit it was just sloppiness.
Wow, this is what I’m thinking about for a while now, no time to write more, but I think I have something of value to add.