Missing from my taxonomy of types of data (as summarized in this post) leaves out mention of dirty data. The different types of data have different levels of trustworthiness where bright data is highly trusted, while model-generated data is less trusted witness of reality. My descriptions of the dim data or model-forbidden data is not the same as dirty data. They are simply less reliable forms of data.
In a later post, I described an analogy of dealing with dirty data. In particular, I described my preference to accept dirty data into the data store.
I welcome dirty data even though I recognize that it raises more challenges for those who analyze the data. Most of the easy analytics, visualization, or machine-learning are easy because they assume the data is mostly clean. These technologies will not be so successful with dirty data.
Dirty data is data that we don’t like, however often it is fairly bright data. I welcome dirty data because in many cases the dirt is where were can discover new truths about reality, in particular concerning questions we haven’t yet asked.
In my last post, I describe the rape details in the recent Rolling Stone article as being fake at least in terms of not being thoroughly investigated before publication. These details may qualify as dirty data and some have advocated that the story be retracted because it was not properly vetted. I do not have a problem for this data to persist in the data store. It is bright data in the sense of being one person’s accounting of some event (real or imagined). Even if the story turns out to be proven false, the story itself can prove useful to illustrate the kinds of stories that are fueling the sense that modern campuses are unsafe particularly for women students. The data offers some potential to learn something new.
My taxonomy of data did not include dirty data. Dirty data is data we don’t want in our data store. To me, it is not justified to exclude data we don’t like. On the contrary, we may learn something new from it even though its presence makes the project of learning far more difficult.
Another description for dirty data is data that is unfit for publication. This allows us to posses the data in private data stores so that it is available for analysis. We draw the line at releasing this data. This concept of dirty data kept private becomes more problematic as we enter the dedodemocratic age that will permit citizens to have direct access to all data. People will be able to observe this data just as they would if we had published it. My preference is to learn to recognize and tolerate dirty data but accept that it will exist in the data store, perhaps even in great abundance.
We can flag the data as dirty either explicitly with an “is dirty” column or implicitly with analytics that demonstrate the unacceptability of this data point for certain topics. I prefer the latter approach. Designation data as dirty depends on the context. The data may be quite clean in some circumstances and dirty in others. Again, I return to the Rolling Stone example of the depiction of the rape incident. This information is dirty in the sense that it should not be used to accuse individuals or collectives of crimes. This same information may be quite clean and acceptable to use to illustrate the kinds of stories being communicated that spread the fear that the campus is an unsafe environment for women.
Dirty data (data we don’t like) is useful because it offers an opportunity to learn something new about reality. The challenge for big data projects is to obtain sufficient quantities and diversity of dirty data so that we can begin to learn something unexpected from it. The challenge is that current established data practices place an emphasis on data governance that results in extensive data cleansing. The result is that dirty data is rare if it is present at all. These practices are as valued in journalism as it is in formal database or data warehouse practices. Despite these noble goals, we admit that occasionally some dirty data will be included.
Cleansing is never perfect. We can never be sure that our data is completely free of data we don’t like. For this reason, it would be better to accept the fact that dirt will be present in data. In particular, we need processes and practices that tolerate dirty data. The best way to develop these dirt-tolerant practices is to have an abundance of dirt is present in the data. We need more dirt in the data.
As I mentioned above, dirt is data that is not suitable for publication. If we can not defend some observation, then we should not publish it in a work of non-fiction. We may retain that observation but only if we confine it in our private data stores.
I assert there is a benefit to having dirt in our private data store. Abundant dirt permits development of processes and practices that tolerate dirty data and this will make it less likely that these processes and practices will embarrass us.
More importantly dirt offers the opportunity to learn something new and unexpected about reality.
I am reminded of an analogy that involves dirt literally. In the beginning of archaeology there was an eagerness to collect ancient artifacts. Often this involved aggressive digging until something man-made was encountered and then that object was hastily removed and immediately cleaned so that it can be presented in some collection. Later we learned of the damage of the destruction of abundant information about the relative locations of the buried artifacts and in particular the microscopic evidence that is embedded in the surrounding dirt. For example, the dirt may yield evidence such as plant pollen to indicate the season when the article was discarded or even the time period when that happened. Today, the dirt is part of the recording in archaeological digs. Frequently the dirt is uninteresting but it remains available for interpretation. Occasionally, we learn something unexpected. Modern archaeology is far more demanding the earlier practices because modern archaeology respects dirt.
The analogy isn’t perfect with data except that we use the same word. Dirty data is data that somehow offends our sensibility. We need to respect this data even though we find it offensive. It can become valuable in the future.
Returning to the Rolling Stone article describing the rape incident. Even if the story were completely fabricated but we have confirmation that it is told by an authentic campus student, I argue that the story is valuable as a sample of the stories that support the concept of a rape culture on campus. This account can present valuable information about the campus culture.
I would assign the term gossip to include stories like this. Generally, gossip does not qualify as serious journalism and this is part of the reason for the controversy surrounding Rolling Stone’s decision publish the article. However, gossip can provide value in terms of informing us something about a culture or sub-culture. One way to understand a culture is to understand the stories they tell each other. We should encourage the collection of these stories, including gossip, in order to populate data stores to support a broader analysis of the culture.
Currently, many data stores rely heavily on bots that scrape information off of published articles. These bots can capture gossipy information from social media includes gossipy information, but even today only a small number of people participate in sharing gossipy information. There are even fewer of these sharers who are willing to identify themselves.
As mentioned in past, Journalism has skill in collecting stories from reluctant sources. Journalism skills would be helpful to obtain the gossipy information from those who do not wish to gossip. Presently, the journalist is discouraged from collecting gossip for its own sake because this gossip is not suitable for publication.
Gossip is suitable for data stores that include gossip from other social media sites, but there is no economic incentive to have journalists collect stories from reluctant story-tellers. We need journalist skills to obtain this information, but we need a different market for them to provide this information without the need for professional publication.
Rolling Stone account of rape is an example of bright dirty data. Despite many criticisms of the magazine for publishing the story, the story itself is useful for a data store of stories told in campus cultures. In particular, this data is especially helpful to understand the understanding or perception of the problem of unpunished rape on campus. Stories like this one can help answer whether the problem of rape is exaggerated, rumored, or there is firm evidence of unpunished rape. Even for unpunished rape, the stories can tell us what actions are perceived as occurring without punishment. A more exhaustive collection of similar gossip stories can help answer these kinds of questions. A large number of non-verified rape gossip can suggest patterns either of exaggerations or of negligent prosecution.
Similarly, the recent news of excessive and undisciplined use of force by police can benefit from a collection of anecdotes from reluctant gossipers. For example, the recent police shooting of an unarmed man in Ferguson Mo has generated a number of rumored stories that lack conclusive proof. We learn of certain stories such as the “hands up don’t shoot” story through popular repetition in social media. There may be other similarly controversial stories that we are not learning. Like in rape, tellers these alternative rumors may be reluctant to publish their stories. Perhaps this reluctance comes from their own involvement in the story or that their involvement is ambiguous. Alternatively, the reluctant story tellers are simply reluctant to tell stories to anyone other than trusted friends. Journalism skills, in particular the skills to build up trust as a friendly ear in an interview, can be very valuable to obtain these close-held stories of gossip. We can use this data to learn more about a particular culture.
I am advocating for a form of journalism that collects gossip. I describe this as first-person story journalism. When repeated often enough, we can obtain real value through analysis of the multiple stories.
I distinguish my point from others such as this article by Victor Davis Hanson that criticizes the publication of lies as journalism. Faithful chronicling the fables is useful especially if it encourages more documenting of fables shared within a culture. We can learn a lot about the patterns in the gossip and the consistencies may tell us something new about what is going on in culture. Perhaps he is correct that such unchecked stories should not be distributed through respected journalistic publications.
This expectation of thoroughly checked journalism may be rapidly becoming obsolete with the modern technologies of cheap publication available through social media. Similar stories are being distributed in social media, but these are biased by the minority of story tellers who are eager to share these stories. The Rolling Stone article demonstrates a value of journalism to capture otherwise unpublished stories by reluctant story tellers.
There are many rumors and gossips that may be shared among small groups of people where no one in the group is active in social media publishing of these stories. I assert these stories are valuable to augment the stories more readily shared in social media. The mechanism we have to obtain these stories is published journalism though subscriber or advertiser financed periodicals.
Accepting the recording of these fables or gossips does not preclude investigations to verify the details within the stories. For example, this article defends from criticism the journalistic efforts to confirm the details of a reported story.
An error or an unacknowledged falsification doesn’t categorically, automatically invalidate everything else a person is saying. But it does shed some light on the degree of trust we should place in that person.
I agree that this follow-up journalism is valuable and welcome. The additional fact checking can clarify what parts of the story may be based on facts and what parts may be based on exaggerations or fabrications. I want to see both. There is value in faithfully recording the telling of a story exactly as it is told within a culture. There is value in following up on those stories to determine what facts can be verified or invalidated. I think there is even more value if we see these published separately instead of having only one clean well-researched account.
A goal of collecting stories for big data analytics is to attempt to capture the full ranges of stories shared within a culture. This will enable us to discover patterns that can be tested for whether as a whole the truth is on the side of a failed justice system or on the side of human imagination to develop a group identity.
This is a higher level of truth to evaluate the veracity of broad claims such as there being a pernicious rape culture on campus or an out-of-control culture of police malpractice. Of course, these same data stores should welcome follow-up stories that verify or invalidate details of individual stories. Such verification or invalidation can help to determine whether the broader claims are justified. Even if all of the claims are false, there remains much to learn about the culture by paying attention to the fables and gossips they share within their groups. This information can help us understand why the culture is different from other cultures. It can help us measure the size and influence of these cultures. We can make positive discoveries about cultures based on even the lies they tell.
We need a new market for journalism to collect data for data stores in such a way to bypass the cleansing mechanism of publication. We can learn from a large repository of dirty data involving unsubstantiated rumors or gossip that is not suitable for publication.