As people consider their continued investment in big data and its associated technologies, they are probably interested in knowing what the downsides might be. Can the project fail? How bad can a failed project be?
Promoters of big data and associated technologies frequently release some news about some organization realizing some major benefit from their investment into a big data solution. The release has details about the challenges of their data and the outcomes that resulted.
It is not surprising that we don’t see any reports of a big data failing to achieve anything beneficial. It is unlikely that someone will want to draw attention to a failure. However, it is also likely that the participants are only disappointed that the success is taking longer than they had hoped but success is inevitable in the long run. It is easier to recognize and promote an early success than it is to determine that a project is never going to succeed. Perhaps the field is young enough that there haven’t been any abandoned projects.
As I mentioned in an earlier post, a failed project may have to be especially disastrous to become newsworthy. By the time that happens, there will be many others who will be a similar situation so that the lesson came too late. The newsworthy failure may involve a project that used well accepted predictive analytics on well accepted data. The failure occurred despite doing everything right, although at that time we’ll probably find some flaw and accuse the victim of some fraud or negligence in hindsight.
So far there have been no comparable reports of failures that resulted from big data projects to balance the many reports of successes. It gives the impression that big data and more generally the modern data science can not fail.
However, there are frequent articles about the importance of investing in data quality and adhering to good data governance. If a failure does occur, it is likely to be blamed on bad data or bad practices. The above hypothesized failure is one that has good data and follows good practices.
The question is whether big data, analytics, and visualization can fail. If the project can fail, how bad will be the consequences.
So far it seems that big data can not fail. Every success is linked to its use of data. Every failure has some other explanation unrelated to data. Data science begins to look like a faith that can not be falsified.
A decision to continue investing in big data involves a decision on the validity of the claim that the investment will be more likely to result in benefits than failures. This decision appears easy because good data will never let us down, especially when we use good algorithms and visualizations.
What experiment could show that data science can fail even with good practices and good data?
The faith in the inevitable benefits of big data seems a lot like a religion or a least an appeal to a higher power. Perhaps the claim is similar to basic axioms like those in geometry. The benefits of big data are self evident and beyond challenge. Big data will not let us down. If something goes wrong, it is because we failed to obey the data or its rules. Occurring in hindsight, we will find some clue that human fallibility is the cause of a failure, not data science fallibility.
I think this is unfair because the data often is about humans and the questions have some degree of sociological or psychological theory. Humans have to interpret data of other humans with human explanations. We draw the line and say that the good book of big data will never be wrong but humans will be fallible.
The benefits of big data is self evidently true. This makes the decision to invest in big data easy because there is no alternative. The question for the decision maker is how to invest in humans to assure they will never make mistakes. Ultimately, any failure of a big data system will be a failure of humans, not the consequence of the concepts of big data itself.
Much of the big data we are facing are in some ways traceable to data about people and their activities. A better term for big data might be crowd data. Because humans are fallible, the data about humans is subject to manipulation by some people who don’t play nice.
In the previous post, I discussed observations about an article that present numerous examples of using geographic metadata from people’s use of applications to build maps of where people go or how they get there. I described a hypothetical scenario where the participants deliberately change their behavior after seeing the visualization, with some possibility of mischievous plans.
I want to revisit the discussion on the mapping of people’s jogging routes in cities though the use of data for apps specifically marketed for use in tracking jogging or other exercise activities. In this case, the analyst may assume that anyone using this app is using it for its intended purposes. The geographic meta data represents a deliberately planned activity of some form of exercise. There may be a need for data cleaning to remove the data from someone who forgets to shut down the app after the exercise is completed. But we at least assume that the deliberate use of the application represents the intention to do some exercise.
The visualization of exercise routes for Washington DC definitely outline the more popular routes where the streets are safe and convenient for running or where there are extensive trails through parks. These show up as bright and thick lines due to the overlapping observations.
However, how should we interpret the very thin lines that seem to go off into unusual spaces and possibly routes that are less suited for exercise. The displayed chart may be of raw data. As I mentioned above, there is a possibility someone forgot to turn off the app and we’re instead tracking his non-exercising commute.
Data cleaning is part of the good practice of data science. We can attempt to clean this exercise data by applying models. In earlier posts I described the cleaning process producing forbidden data: data to be rejected because it does not fit our expectations. We do not want to eliminate the possibility that there may be some unusual secret exercise routes so the mere uniqueness may not be a good criteria. We will instead seek other explanations that will perform the same task of removing these outliers.
A possible cleaning algorithm may reject apparent speeds that exceed what is expected from running, jogging, or bicycling (or at least separate these into different categories typical speeds). Another may be to eliminate a subsequent unusual route if the same individual followed a more common route earlier (indicating he merely forget to shut down the app).
Clean big data is good data practice. Something like the above algorithms could be used to clean up the data visualization to more clearly identify the real exercise routes. With the extraneous dirty data removed, we can turn the data over for our trusted analytics solutions.
Perhaps one of these analytics projects is to allocate policing resources to observe the exercisers are obeying the laws. For example, we can use this data to isolate bikers and then seek out violators of some law that requires the use of helmets. We would police the known biking routes based on the clean data. The city may have a goal for 100% compliance of this law and eventually the reports come in that this goal is met for a particular month. That same month records a medical report of a biking accident that occurred when the biker was not wearing a helmet.
The biker was following one of the routes that were consistently eliminated because he managed to be an unusually fast biker who could keep up with car traffic on a busy street that few people would ever consider for biking.
This example is silly and unrealistic. No one is expecting that kind of policing and compliance. I’m trying to make the point where good data and good data practices can produce a bad recommendation. In this case, it failed to recommend policing one route for helmet usage compliance. I argue that the fault lies in big data, and in particular in our conscious decision to base a policy around some automated visualization of big data. In my imaginary scenario, I assume we would leverage the more efficient bike-helmet enforcement based on big data to save money by reducing the job duties of other departments such as traffic patrols on busy streets. The win for big data adoption is to save or make money. We will rely on data to replace earlier methods and this will result in benefits. In this case, the cost savings resulted in a less effective approach.
It is unlikely that we will attribute this failure to the big data project itself. We will seek a human actor to place the blame for this failure.
We may trace this fault to a human error of the data scientist being too aggressive in cleaning the data to remove this particular bikers and his unusual route. The failure is that the analyst was not skilled enough. This is big reason for the high demand for advanced analytic skills. Because big data concept is infallible, any failure must be due to human error. We need analysts who will never make mistakes.
Alternatively, we will blame the biker producing the traced data for deliberately avoiding routes that were monitored. He may have observed the publicized visualization of common biking routes and the selected the route that no one else uses. He may even notice that his use of that route does not register in future visualization so he knew he was invisible. He wasn’t playing nice.
But the concept of using big data, analytics, and visualization is untouchable in its perfection. We just need to perfect humans or weed out the imperfect.