This post is in response to the reciprocity expectation introduced in this article suggesting privacy reform acts. In that article, it conceives of a retaliatory option allowing individuals to use similar techniques against their government or accusers. When I saw the heading of reciprocity, I had something else in mind.
In an earlier post on this blog, I explored some of my thoughts on the privacy implications of big data. In that post I suggested that when we are familiar with the underlying private information, we can readily recognize that that information can be reconstructed from meta data. I also suggested that this bias of knowing the secret blinds us from realizing that that revelation may not be obvious to an analyst not aware of that secret. Most of the argument about bulk meta data collection centers on revealing of private information. Unfortunately this argument misses the a more frightening danger. The above lesson is that bulk data can (and will) find results matching a preconceived notion. The real danger is that that preconceived notion could be wrong or the result is a random coincidence.
With enough data, there is bound to be some data that matches a hypothesis. A query for a set of suspicious traits is likely to find several records. When this happens, the claim is inverted to say that such records have these traits that are very suspicious.
I consider it a logical fallacy when we invert our presentation this way. One individual is found in the results of a query of all people who have a set of traits. Then, there is an announcement that this particular individual has this trait. The latter presentation implies that that individual was specifically investigated and had cause for investigation. Normally this would involve some kind of justification for investigating this particular individual in the first place. In fact, there was no prior justification for investigation of this particular individual. Outside of that particular query, there was no reason to suspect this individual.
The fundamental goal of bulk data collection and data mining is to find suspects based on patterns of traits we presume appear suspicious when attached to an individual.
It is hard to recognize this problem working up inductively from our everyday experience. We have a certain degree of familiarity with everyday encounters. That familiarity guides us into recognizing what may or may not be suspicious.
As individual citizens we try to comprehend the issues of big data through induction starting with examples from our own lives. If neighbor could have some general awareness of their neighbors, then we could do the same thing at a larger scale. This inductive reasoning prevents us from realizing that something new emerges when the collection becomes so broad that there is no longer any familiarity with the individual entries.
A query against big data for suspicious characters could reveal results that would never be flagged as suspicious locally. Even if it were locally suspicious, it would be investigated locally involving people who are more likely to be familiar with the local circumstances. In contrast, big data queries are performed very remotely and trigger remote investigations.
Another way to describe this is to say that local accusations are increasingly debated at the national public opinion level. Having no familiarity of the specific context, we readily accept the implications of the combination of suspicious traits. Suddenly, the entire nation or world are neighbors, but only for an instant when it is discussed. This is silly.
As suggested in the first paragraph, reciprocity can be a counter-balance to big-data mining abuse. My suggestion for reciprocity is that we declare as open data all of the bulk data that the government collects. Based on court rulings that say that individuals have no expectation of privacy for meta data, that data should public domain.
Recently there have been many initiatives by large organizations to share their data. This is broadly described as open data and modeled after the ideals of open source software. The data should be available for all to use. The initiative goes further by committing to standards and tools to make that data easy to access and to query.
My suggestion is to apply this same concept to government bulk data collection. If we have no expectation of privacy for this data, then every one should have the same access to this data as does the government analysts. We should have full access to the data and to the tools used by analysts for work that is not specifically protected by a warrant. The warrantless search capability of bulk should be available to everyone.
One immediate consequence if there were law is there would be more political opposition to this collection. It is one thing that non-private data is available to a select few. It is quite another when that same so-called non-private data is available to everyone. That consequence may be temporary however. We can already find plenty of commercial services that allow anyone to pry into non-private information of others. We are becoming accustomed to the fact we can’t control this data.
The data is non-private, it is public data. The government should open its tools for everyone to access this non-private data.
My motivation for proposing reciprocal access to bulk data collections and tools is more for the benefits of crowd-sourcing to scrutinize the initial claims.
Take for example a dragnet-like query results in some findings. With access to the same data and tools:
- We can show that very similarly troubling but different queries result in other findings. We can question why we should focus on the first when there are so many other equally troubling results.
- We can show that the data set is incomplete in terms of traits that could excuse the pattern as normal and not alarming
- We can show that the data set has gaps in time that can raise questions about whether a one time event is really routine for these cases
- We can show that the data set has gaps in comprehensiveness. This can suggest that there would be a vast larger number of findings for the same query query. Either the combination of traits is very common and thus likely not to be a concern, or there are too many results to practically investigate them all.
- We can show the data include contradictory or mutually exclusive properties that can allow us to question the validity of the results.
- We can show that the query inappropriately excluded some important quantities such as being sure that all of the traits occurs at the same moment of time
- We can show that the information presented is inconsistent with the results we get with same queries
In other words, give everyone the opportunity to participate in the practice of data science. Data science is the historical science of scrutinizing the available data and challenging conclusions. It is very labor intensive. For big data, almost certainly this scrutiny of data is grossly under-budgeted if it occurs at all.
Making the data and tools reciprocally available to everyone will allow us to more thoroughly scrutinize the data and the tools for correctness.
There is no legitimate reason not to make this data available to all. The government is allowed to collect it precisely because it has been determined to be non-private. All of this data should be available to the public at large to scrutinize as we wish.