Lately there have been many articles presenting enthusiastic promotions of the the positive possibilities of large investments in big data systems with associated analytics, predictions, and visualizations. Many of these articles discuss the labor shortage of data scientist. In some of these articles about what make data scientists different from other disciplines, I frequently encounter the analogy of the data scientist being like a story teller. They are basing the stories on data, good data practices and sound statistical methods, but ultimately the product is a story that entertains by being something that surprises us, something we have never thought of before.
Big data build a very compelling foundation for some conclusion. That conclusion may come from some statistical model, or a machine-learned decision, or a detailed visualization. However, the conclusion itself is presented by a human who will arrange the evidence into some form of narrative that presents a persuasive argument for a particular conclusion. This narrative follows professional standards, but the richness of presentation options and of the big data itself introduces opportunities for being creative in selecting the stronger images and downplaying the weaker ones. This opportunity for creativity, I think, is what is being acknowledged with the analogy of story telling.
Big data presents many opportunities for creative presentation. This is very apparent in data visualizations. Visualizations catch our eyes and that compels us to spend time looking at its fine and often very beautiful and arresting details. The visualizations are even presented as works of art at least in terms of becoming background patterns for promotional material.
For those who are more interested in numbers, the presentation of statistical results provide a similar opportunity to be aesthetically appealing.
The aesthetic component is the story telling. We base those tales on data that have crossed our challenges for appropriateness. The conclusions have a foundation of professional approaches. The final presentation leverages the aesthetic qualities to tell a tale.
Story telling can be fun. Story-telling backed by data and statistical conclusions can be dangerous.
I tie this danger to earlier posts where I described the potential downsides of model-generated “dark data” and to the danger of spurious conclusions. In the latter post, I proposed a story of my own that could explain the inverse relationship of honey-producing bee colonies with juvenile arrests of marijuana possession. There are probably even better stories that could come up.
The only thing preventing a spurious correlation from becoming an hypothesis is a reasonable narrative that appeals to our understanding of the world. This is the dark side of the business. We appeal to the authority of big data to defend a narrative that ultimately rests on our prior understanding about how the world might work. A discovered hypothesis from data gets its value by being surprisingly new, but not excessively surprising. To gain acceptance of a hypothesis we rely heavily on the narrative that refers back to what we already believe about the world.
Story-telling confirms our misconceptions with more data.
An extreme and intentionally ridiculous analogy to this evidence based story telling is famously illustrated by Rudyard Kiplings’ Just So stories. In these stories, there is some attempt at a plausible (to a child’s mind) explanation of certain facts. These stories are presented in a way that even a child will recognize that the story-teller is being intentionally ridiculous. This makes the stories harmlessly entertaining.
In contrast, the professional data scientist story telling has the motivation of persuading decision makers to follow the data-supported recommendations. The presentation leverages real data, industry-accepted analytic approaches, and vivid visualizations to present a persuasive argument in the form of a narrative that includes some appeal to the audience’s understanding of what is realistic.
At least some professional data scientists are describing this final presentation as a creative effort analogous to story-telling. In general there is abundant creativity in the final presentation.
The visual presentations often are highly stylized with multimedia and artistically appealing rendered graphics. The style part of the presentation includes non-data non-scientific elements such as animations, background music or professional voice speakers, and careful selection of complementary colors and high resolution background images.
An example of a non-data stylistic addition is the high-resolution satellite imagery as a background for a map that presents some geographic data (such as income) that has little to do with the terrain and the dominant tree life in that same area. The satellite imagery supplies the creative narrative that turns otherwise simple income data into a more interesting story. Even though the speaker doesn’t mention it, we can come to our own conclusions about the fact that the proximity of certain features may influence income opportunities. Some may be valid (such as proximity to unpleasant areas) and some may not be valid (such as proximity to pine instead of deciduous forests). The imagery transforms the data into a story.
Even more subtle is the choice of what data to use for the presentation. One of the defining characteristics of big data is its variety, that there is an abundant number of different types of data. An analysis can consider dozens of variables to show a relationship or to defend or test that relationships and yet still leave hundreds of variables untouched. With predictive analytics, we hear of attempts to find patterns with hundreds or even thousands of dimensions that promise highly individualized conclusions. An example is the promise of individualized medical care based on a particular person’s genome: a particular treatment may be adjusted based on a large number of genes.
I discussed this problem in the discussion about spurious correlations. With enough variables, there is near certainty that there will be many that can show strong correlations or clusters. Some of those results can have a story-telling angle that appeals to our preconceptions of how the world works. As I discussed on my post on accessory data, we can easily obtain solid observations of a person’s choice of fashion to wear to a routine doctor’s visit and this data may correlate with some medical outcome. That fact may become one of hundreds of variables in a determination of an individualized treatment plan. We may never know that this fact is used in the predictive analytics or know if the prediction could change if we remove its consideration. A more realistic example of accessory data is the body mass index (BMI) that is easily measured but controversial in terms of relevance despite its appearance in strong correlations.
The BMI controversies are a very good example of story telling. Initially, the concept was easily recognized as a good measure of body fat because we can imagine the bulk required for a certain height to have a certain weight. It was only later that it became common knowledge that muscles are denser than fat so that athletes can have overweight BMI values. Even with that caveat, a glance can determine whether a person is athletic and for a non-athlete the high BMI value is most likely from fat. The second story telling aspect is that we readily accept that fatness can lead to certain kind of adverse health effects especially those related to the heart. We are eager to use BMI because it is easy to measure and it is measured consistently by everyone. We are also eager to accept its consequences for health by our preconceptions of the potential downsides of fat. Lately, there have been more subtle understanding of fat where sometimes additional fat can improve certain outcomes or where the impact of fat depends on where the fat is stored. However, the general consensus is that in terms of all possible forms of ailments, a low BMI is better than a higher one. That consensus view has the backing of both statistics and story telling. The statistics alone is debatable based on weighting of benefits versus hazards. The story telling of the intuitive notion of harms of fat provides the needed extra push to give the benefit of the doubt to the conclusion that high BMI is in general unhealthy.
For this post, I want to point out the human tendency to enjoy the practice of creative story telling or to find entertaining the well told story. There are many distinguishing qualities of humans from other animals (such as tool use, domesticating animals, etc) that all have been discredited by non-human examples. The one uniquely human quality that seems so far to be unchallenged is our ability to tell stories. Humans are the story-telling animal. We love to tell stories and we love to listen to them.
In the past, story tellers found inspiration from natural evidence. For example, Homer’s epics may have made some references to certain names because those were prominent family names at the time of the telling rather than the time of the events. We have a rich tradition of describing animal actors with human like speech and behaviors. Often these stories are very believable. In Aesop’s fables, for example, we can at least briefly imagine insects like the ant or like the grasshopper behaving like different kinds of people.
Today, big data systems provide a wealth of new material to inspire new and very exciting stories. Until recently, big data was obscure and led by professionals who were focused on their reputations in their field and were not necessarily good story tellers. Now big data is in the mainstream consciousness where many people are drawn to the topic because of its popularity. Lots of people claim to be not only specialists in big data but leading evangelists of the concept.
Big data is clearly an inspiration of many branding stories that set apart certain individuals as well as companies. Many of the individuals promoting big data are invited to publish or to speak to present their vision of the future of big data. That vision clearly is story telling. The fact that they are invited to return to speak or publish again is evidence that their stories are entertaining.
The very nature of personal or corporate branding involves story telling. The story provides the attraction to recognize a particular brand. Using story telling for branding is harmless even if it involves big data concepts. The concern is that this branding carries an implication of authority. Popular brands based on successful story telling are enjoying a reputation of being an authority. When they present professional results based on actual practice of using data, they will benefit from this brand recognition of being an authority. Increasingly, the person making the presentation of big data analytics is a successful story teller.
The danger is whether we will be able to distinguish the objective facts from the embellished story telling when those embellishments are drawn artfully from data. I described earlier the presentation of a high-resolution full color summer satellite image as a background to mapped data. Satellite image data is authentic data, but its inclusion in the report was for artful story telling purposes. The satellite imagery has no relevance to the mapped data or at least the mapped data conclusions may not have considered any information apparent in the satellite images. The image completes the story, it makes the story entertaining. As any good story, it invites the audience to exercise their imagination. There really could be an influence of the type of nearby forests to the income of the residence. The audience is free to take the story details as part of the facts even though the basic professional data did not include any such consideration.
As we push data closer to the individual level, I worry more about story-telling data scientists. We generally recognize that there is a wide diversity of healthy human behaviors. However, many of these are rare and few of them are very widely visible.
We have always dealt with this problem when it comes to politics where a very capable policy thinker and negotiator may be eliminated from consideration because his demeanor is not like the bulk of the people. An excellent example in my mind is the “have a beer with him” test. A person who presents himself as someone many people would like to drink with will be more successful than a politician that does not seem as fun to be with.
Certainly politics for elective offices depends heavily on branding and that branding depends heavily on story telling. For better or for worse, we decide elections based on the best stories rather than the best capabilities. That is the way democracies work, and this method appears to work by keeping people happy with their elected officials.
But with the increasing intrusion of big data into personal lives, we are entering a world where everyone will become a kind of politician. Already there have been stories of medical outcomes that were decided largely on popularity of the story told about the patient. The challenges of government in general and health care in particular is how to best allocated limited resources. The promise of big data is to provide objectivity to allocate resources with optimal effectiveness. The downside of big data is the opportunity for story tellers to pick and choose the data that supports an good story. Big data celebrities will build their brand on their ability to artfully present data that produces a popular story.
The risk is that big data winners will be the story tellers rather than the purposes we seek to improve with big data.