In big data, the discoveries are little things that catch our attention. This is much like scanning the daily news and spotting a story that catches your attention. It is a little extreme to compare big data analysis with reading the morning newspaper, but there are parallels.
The amount of fresh content available daily as news or commentary is staggering. But saying that checking the Internet for news at least suggests an analogy that it is like the daily newspaper of a half-century ago where it is possible to scan every page and pick the one or two articles that are worth investment to read (for those who didn’t read the entire publication from front to back). It is impractical to approach the Internet that way. Instead I use news aggregator sites that over time have a history of spotting stories that interest me. Then I pick on that story, follow the links within that story or look for related stories. I navigate around from the entry point to gather up a my own story made possible by the availability of a huge amount of data in the form of written narratives or talking-head videos (that regrettably are replacing searchable narratives).
This vision of navigation is similar to how I explore regular data sets searching for validating information of top-level reports. Often the top level reports are presented so compelling that it invites you to just stop there and admire the result. This is not unlike news aggregator site that suggests just seeing the front page could be a fulfilling experience.
An example of this is a recent story about several smart cars being flipped over from their parking spots. That caught my attention because I’m a smart car owner, I would be very upset if that were to happen to my car, and I’ve wondered that the car wouldn’t be hard to flip over because most of the weight of the engine sits over the rear wheels set far to the end of the car. Sure enough one of the photos showed the car standing on its rear edge with the damage consequences obvious.
Lesson one of this little example is why was I drawn to this particular story as opposed to any other one. I was biased. This story affects me directly. Often when we decide to investigate some pattern it is because we recognize it in our personal lives. The data is personal in some way.
A more subtle form of this bias is that there is some personal experience with this topic. This is far more common. Perhaps the person as only seen the cars or were aware of some discussions about the cars. The familiarity alone is a bias to focus on one particular pattern rather than any others.
The familiarity bias is not necessarily a bad thing. The familiarity can guide where to investigate next. But it can be a bad thing when we allow ourselves to be blind to the investigation of patterns that don’t suggest any familiarity.
The opposite of the familiarity bias is the exploration bias: seeking out only the least explicable patterns. These efforts may appear handicapped by the lack of familiarity. However, a good explorer is one who identifies clues that can be searched for prior knowledge and in effect learn has he proceeds. The explorer is not limited to exploring a data set, he also uses available knowledge outside of that data set. Compared with the familiarity-biased, the explorer-biased is more likely to be a discoverer of new hypotheses.
Back to my familiarity-biased investigation of the tipped smart cars, I became interested in finding out more about what explains this interest in tipping over cars. From what I read, this is completely missing data. Most of the time the cars are just discovered turned over.
One report mentions a sighting a group young people wandering in the vicinity at the time but this report does not include an eyewitness of the actual event. It seems most likely that more than one person would be required to flip even this small of a car. Conclusion, the group of young people might have done this for at least one case. In addition, we live in an era that cherishes the creation of a new “viral” trend produce highly shared photos are copycat acts. It is not hard to conclude that all of the tipping is part of a new viral trend.
This is what I call dark data. We used a model to fill in the obvious blank of not knowing who the culprits are and what their motives are. We supplied the culprits and motives with an explanation that is easy to believe.
Recall that I use the term dark data to distinguish it from bright data. Bright data is a well documented unambiguous observation. A video of the actual tipping of each event would be bright data. Dark data is not necessarily bad, it just needs to be treated with more caution.
Dark data can be a hazard if one takes action based on that data. Someone may misconstrue a local argument with a global trend and start viral trend instead of extend one that had not yet really started. Alternatively, someone like myself may start to worry that my car may be next. Both are projected not from actual observations of the data but of an invented piece of information that seems plausible. In both cases, the observed data does not justify the actions.
Another curious detail was one report that mentioned on a liberal political sticker on one of the tipped cars. This detail is missing from the other cars but it reinforces the image of the car appealing primarily to liberals. It is easy to propose that the culprits may be driven by political disagreements or some such. The association of the car with an ideology is also dark data, but it is something that could be researched.
When the car first came out, I was surprised that most of the buyers were like me: middle-aged or older. I should not have amazed me that my own age group would share my preferences, but I assumed it would appeal more to the youngest group (who at least initially did not find the car interesting). The other thing is that there is no real pattern in terms of politics of the owners. The real pattern I saw were that people were downsizing. Some chose the car as a statement that they’ve moved on from being the taxi/hauling service for their children. Others just found the car to make a great second car for individual errands or commutes. I rarely heard of anyone equating it with a political statement. For my part, I described my motivation of getting a car that fits my current needs as being the same as my motivation to get a suit that fits my current body. I also saw the car as making a very conservative statement as inexpensive and meeting precisely my needs and no more. I don’t think I’m alone in that view.
Anyhow, just as there is an eagerness to see the start of a new viral craze that will eventually be reported in every city, there is an equal eagerness to see one or the other side of the political divide to escalate into physical attacks such as vandalism against their rivals.
Reading comments to the reports reveals that at least among those who like commenting there are strong feelings about the political implications of smart cars and their tippings. Luckily I have never encountered such people in person. Reading the comments, it is easy to conclude that the phenomena of the cars or their tippings to be the harbinger of a coming civil war.
The Internet has commercialized into a model that compensates individual content based on page views. Coincidentally, adding comments to articles drives up the page view and unique visitor counts far more than the original article. The page is sticky for an individual commentator who refreshes to see responses to his comment. The page attracts others just to watch the comments unfold. Like magic, adding a comments section greatly increases page views and unique visitors.
In terms of content, the comments are data just like the article is data. Both contain a mix of bright and dark data. Both have self-selection bias. The original reporter or opinion writer may answer to an editor or his reputation for research but for the most part has a lot of latitude about what to report. In contrast, the comments are highly autobiographically-generated data points that are self-selected to contribute in discussions. Often the comments present a very distorted view of the world. The opinions expressed are either fantasies or are very well suppressed in social settings. All that said, reading online comments about smart cars does make me more alert to driving intentions of neighboring cars on the road.
I think the comments have a useful purpose to remind us that the narrative is far from complete. Because the comments are more incomplete pictures than the original article, we are reminded that the article itself may also be incomplete.
The final lesson is that data is incomplete. We don’t have all the observations we want and certainly not enough to solely justify a conclusion. This should be very obvious with news stories and especially commentary. Often it seems some people are satisfied that the news is sufficient to reinforce their opinions or to encourage them to take some action.
This lesson is far less obvious with big data projects. The amount of data in big data projects encourages us to conclude the data is complete, or at least any missing information wouldn’t matter. The analyst front-end of such projects presents compelling and publication quality reports. The entire promise of high-productivity analysis invites us to just use the results as it is presented to us. Most often, we accept that invitation.
Big data is a lot like the Internet itself. Both presents a wealth of opportunities to explore and challenge data. Both have finished results that eagerly encourages us to “share” the results with others. But there is a more healthy cautious approach with data found on the Internet versus data found from big data solutions.
We know the Internet includes sites that deliberately try to deceive or at least manipulate us. The mechanical processes of collecting big data seems to rule out that this deceptive intent. But the potential of deception is still there in the forms of assumptions and models that select what data to include in the data set.