For this post, I will be discussing some observations of numerous visualization in this article on people movement. The article provides a survey of multiple ways of visualization the location information for various ways of defining populations. The representative visualizations are interesting. The article also links to other articles and to various live applications that are worth exploring.
It it likely that anyone following the above link will find so much to material to explore that they’ll never return to this post. I still encourage exploring that data because it helps me make the point of this blog post.
In systems engineering, a dynamic system may have feedback loops. Feedback occurs when the results or products of the system influences the inputs to the system. Feedback may be positive (the system reinforces the fed back information) or negative (the system works to limit or diminish something in that fed back information). Feedback may be intentional as part of the design or unintentional and unavoidable. Feedback systems may be stable or unstable but often somewhere in between in a critical point of almost stable and unstable at the same time.
In any case, we make a distinction between systems with feedback and those without feedback. Systems without feedback are described as open or open-loop systems with behaviors that are indifferent to the effects it is producing.
A cannon ball in mid-flight is an open loop system, it will continue to follow a trajectory determined by the speed and direction it has when it left the cannon. In contrast, a guided missile will make course corrections based on external tracking updates or internal measurements of inertial changes. The guided missile uses feedback to adjust its path while the cannon ball does not.
An example of unintentional positive feedback may be when an earthen dam is breached by flood waters. The water flowing over the top of the dam erodes the dam allowing more water to flow over it and at a faster rate. The combination of more water flowing faster hastens the erosion that ultimately eliminates the dam.
The last example seems analogous to what is happening in current events of human history in many parts of the world today. One event causes a breach that allows more and stronger events that erode and ultimately destroy a civilization. The analogy is apt because while it takes many years to build a dam, the successive stages of destruction can occur in a single day. We are witnessing something similar happening with human events and with similar helplessness to do anything about it.
The analogy of the dam breach feedback is also relevant to this discussion because the dam is merely part of a boundary of a very large reservoir of water. Frequently we describe big data in terms of large bodies of water. With large amounts of data it is easier to think of the body of data instead of the individual data items or even data types.
In a recent post, I discussed a new term (to me) of data lakes where the word lake takes on derogatory meaning (lakes tend to be not as appealing as oceans). Lakes often are formed by natural or artificial dams. The analogy can be stretched to compare the lake dam with the security systems we put in place to limit access to data. If we allow that security to be breached then we run the risk of draining the body of data or at least diminishing the value of that data.
We often attribute the rise of big data or large scale data science to be enabled by technology that allows vast data storage in systems that enable fast retrieval and analysis. I think we overlook a major contributing factor in the way we have changed our attitude about our concern about the sensitivity of data.
The real enabler of big data is our readiness to share data. In particular, the owners of the data close to the source allow their data to be shared or delivered to some central location. In earlier posts here and here, I presented as a new idea to reverse the trend of combining data into a single source but instead build a supply chain that allowed the data owners to retain control over the integrity of their data. The motivation was from concerns about data quality and trust I discussed in earlier post such as here and here. Although I discussed these topics of holding data close to the source as an innovation of the currently popularized big data model, these are actually old ideas. The reason we see opportunities for big data today is because we’ve abandoned our earlier preferences for data sources to restrict access to their data.
We once considered operational data to be sensitive information. We still consider it to be sensitive while it is relevant to the current operation. What changed was our earlier attitude that said it was best to destroy (not preserve) operational data once it is no longer relevant. We considered the operational data to remain sensitive long after it was directly relevant. Today, our attitude is that the lack of relevance to the immediate operation frees the data of any sensitivity.
Today, we take for granted that no longer relevant operational data becomes property of another party such as a data warehouse or some other big data project. In many cases, there is no discussion at all about whether to share the data. For example, in the SIEM field that builds big data lakes by collecting various log files from various systems, it is taken for granted that any and all logging data belongs to the data lake. The bulk of our investment is in how to get that data from the source to the lake. Once the data is in the data lake, it is no longer the responsibility of the data source. No one is concerned like we once would have been concerned.
I wonder whether we conned ourselves into changing our minds about the sensitivity of data at the source. We once were very insistent on protecting far smaller amounts of operational data to the extent of preferring to destroy it after it was relevant rather than to preserve it where it can be compromised. We considered any system that comes in contact with that data becomes as sensitive as that data. We restricted access to these entire systems holding sensitive data to certain trusted and few individuals. Once the data was irrelevant we destroyed it.
The hypothetical con occurred during recent decades where the focus of security shifted from securing data to securing systems. This shift occurred due to real threats against systems in the forms of unauthorized access and maliciously designed software. Given the publicity of system compromises, we focused on securing the systems from compromises. Somewhere along the lines we decided that protecting the data is not as important.
I recognize that there is still a high investment in protecting data in the form of encryption and restricted access networks. However, we did accept the notion of the source of data releasing its control over the fate of data by handing it off in bulk to some other system. The original source of data agreed to relinquish control over managing access and the eventual use of their data.
In the initially linked article we see many examples of people using services to save information that has some geographic tag attached. The location information is part of the information they deliberately wish to save on these systems. However, their participation in this system implicitly grants the right of others to access the meta data that includes the location information. The article illustrates the potential value of combining the meta data of a huge population of participants.
We don’t mind any more that we are releasing information about ourselves to be used in ways we never thought possible. The illustrations in the article may surprise some. A few may object that one of the data points is actually themselves being at a particular location at a particular time. Many will overlook this fact because there are too many points to distinguish and the overall pattern is far more interesting.
A big reason for the recent popularity of big data is the availability of rich visualization technologies that present data in a way that is very compelling, intuitive, and even beautiful. The data are easy to query. The graphics algorithms are easy to invoke. The combination of the two produces very interesting visualizations about the past.
Many of the examples of visualization were drawn from individuals who were unaware that their movements will be available for aggregation into such large scale visualizations. As a result, the initial visualization is successful at capturing where people are going and how they are getting there.
The problem is that this benefit is fleeting. The people did not realize that their data was going to be used this way. Even if they became aware of it through some boring textual narrative such as this post, they will probably not think much about it and continue to allow themselves to be tracked.
The visualization changes everything because the visualizations are so beautifully compelling that it becomes a popular attraction. Publishing visualization and making interactive tools for general use is attracting huge audiences.
Visualization not only promotes big data, it also promotes big audiences. The feedback is complete when the audience includes those producing the data points.
What happens when big audiences feedback to the big data? I can imagine a scenario based on the article’s section “Many points from a lot of people”:
Enter the tracking for individuals today. Many apps for movement still focus on sports-like activity: Strava, RunKeeper, MapMyRun, Endomondo, and plenty more. The data is useful for people with the apps, because they can keep track of how far and how fast they walked, ran, or cycled. People can use this information to set goals and to improve their performance. Some just want to be more healthy.
Here, the people are using the apps for their own private goals but the data is available to make very interesting visualizations of the entire city. A site that offers real time visualization of this information can draw a huge audience and they would use the site often if the visualizations were interactive to allow selecting different time periods or runner starting points.
The feedback scenario will be for those who discover there are other lesser used running paths that may provide variety to their routine. They will change their behavior and then come back to see if that made a change in the visualization for that time. They may coordinate with their friends to change an entire distribution of running routes. They even may coordinate deliberately to run in an undesirable route in order to show up in the graph to suggest it is a desirable one and then come back the next day to see who took the bait.
The initial success of the data visualization as a sociological tool to measure a populations behavior is ruined by the popular success of the audience of the visualization. The visualizations will attract increasingly large audience and web sites will exploit this attraction to draw more visitors to their sites. It is inevitable that big data will meet big audience to produce big feedback.
What can go wrong?
In the above scenario, the audience is the population being measured. The visualization attracts a larger audience than the locals. The attraction is to see behaviors of cities that one may never intend to visit. The audience of big data visualization is likely to be worldwide.
Again, to increase the attraction of the visualization tools, the tools will include more interaction to allow custom queries such as showing those where groups of four or more are running together, or where runners start and end at the same point, or where that point is a specific address matching a government agency. Everyone can see this data. Not everyone is intent on playing nice.
Big data is made possible by two different innovations. We celebrate the achievements of technology to make vast stores of data possible and easily accessible. We overlook our prior concern about the sensitivity of the data. Although the term opsec (operational security) is of recent vintage at least in popular usage, we once had a strong paranoia that operational data can be abused. I think this was heightened by the popularity of operational research that followed World War II, where we learned many lessons of the war from using historical observations and attempted to apply for strategic benefits in both hot and cold wars. Whether done deliberately or not, in the past even city planning data was very closely held and not publicly available.
Today, operational research or its siblings such as industrial engineering, are much less appreciated than it was in the mid 20th century. Taking its place in terms of popularity and aspirations is big data. But in many ways, big data is the opposite of operational research. This is most obvious in terms of appreciating the sensitivity of operational data.
Operational researchers are very aware of how operational data can be exploited for strategic advantage. We don’t listen to them any more. I fear some of the dreadful current events are partly the consequences of this lack of paranoia about data sensitivity. A pessimist could conclude the flood waters have breached the dam and notice that it seems that the dam doesn’t offer very much in the way of resisting rapid erosion.
Visualization promotes big data and attracts big audiences to produce big feedback. It might be stable, but it might not. It might be controlled but we may not know by whom.
Pingback: Can big data benefits be falsified | kenneumeister
Pingback: 2014 political polling was biased: big feedback may be at work | kenneumeister
Pingback: Media’s Ferguson Fable, a morality tale of dark data | kenneumeister
Pingback: Perspective of real time analytics | kenneumeister