Big Data and Privacy

This article discusses the privacy concerns due to the inevitable collection of data from more and more intrusive devices.  In this post I want to describe counter trends that may increasingly protect privacy.  In particular the bigness of data can cripple its usefulness for individual targeting.

I agree with the trend to extend data capabilities to devices closer to the individual.   These devices will offer benefits for the individual while at the same time produce data about the behavior of the person at an intrusive level.   The trend is also for such devices to be connected and share data with other systems or with services on the cloud.  These external services will also provide benefits to the consumer but will release the data to outside of his control.

As we are learning with seemingly benign metadata such as cell phone call records, there is a lot that can be reconstructed about a person’s private life by correlating data over time and matching it with other data (enrichment data).  

Things as innocent as a self-reporting refrigerator that monitors inventory and door opening/closings can be very informative about a person’s habits with respect to consuming and handling foods.   While the individual consumer may benefit by being able to remotely check if he needs to pick something up at the store, the same information may be used in ways to compromise a person’s privacy.

Once the data leaves a device and especially enters the cloud, the person loses any control over that data.  As we have seen, the courts side with the idea that that data is fair game for other uses both by the owner of the service and by government.   It will only get more intrusive in the future.

One trend that I’ve been discussing in earlier posts is a growing appreciation for the unreliability of using data out of context of its original purpose.  

Imagine a data collection of all bar tabs in all establishments.   This data identifies the drink served, the time it was served, who it was served to, and the number in his party.    This data is a sales record for the purpose of settling a bill at the end of the stay.   It can be very attractive to use this as evidence that a person may have drank too much that evening, especially when interpreted by some remote analyst tasked with collecting data on a particular person.    However, the data is a record of a commercial transaction.  It is not a record of actual drinking of the drinks.   Perhaps the drinks were only partially drunk or the not drunk at all.

In the formative period of initial roll out of big data concepts, part of the marketing was unrelated data can be exploited for purposes it was not initially meant to be used for.   This is still a big attraction, but we are becoming more aware of the weakness of data used in this way.

Many times we look at privacy starting with a premise.   Individually, we  can imagine something we will like to keep private.  We are concerned when we find out that that private information can be pieced together by the data artifacts we leave behind.  

This is an unfair way to judge the privacy issue because you know what the private information is and you can map that onto the data.  The actual data is ambiguous and may be ignored entirely unless the investigator already suspects something.

Our data may become a false witness to something we are not responsible for.

Consider the example data again but this time where someone else starts with different premise and pieces the information to confirm his suspicions.   It is not hard to see how our data can confirm a false premise.   After all, big data is all about being creative in interpreting data in new ways.

Big data shares a lot in common with fortune telling: the same data can support multiple alternative realities.

As we gain more experience with big data we will come to appreciate this weakness.   Reuse of unrelated data is not a substitution for direct data.

One scenario I heard was that of a person finding out he has cancer and wants to keep it as secret as he calls his insurance company, some specialists, and hospitals.   The call records can compromise his secret.  But that same data may also suggest his practicing medicine without a license, or it may suggest he is engaging in insurance fraud.   The information doesn’t come out of the data.  Instead the suspicion comes first and the data can be found to match the suspicion.

A recent trend is for more scrutiny of the use of big data.   Big data primary benefit is in hypothesis discovery: giving one clue to start an investigation.   It is almost always misused when it is used as supporting data for hypothesis testing, as a substitution for an investigation or a well designed experiment with proper controls and statistical tests.

Unfortunately, most of the recent utilization of big data has been in its abuse as a substitute for direct investigation or experiment design and execution.   Hopefully, we will become more aware of this kind of misuse of big data.

In the mean time, the data is being misused and the greater risk is being confronted with accusations that are backed up with misused data.  

The big-data analysis conclusion be something completely false.  The data is credibly consistent with the accusation.  The accused is at an unfair disadvantage because he only has access to the analyst’s query results.

Big data also presents opportunities for counter queries.   The nature of big data is its comprehensiveness.   Unlike a targeted investigation, the available data extends far beyond the specific case.

If they have this data on us, then they must also have a lot of other data.   This large pool of data can provide evidence that the incriminating pattern is not unique to this one instance or unique to us.   Also this large pool of data inevitably has large gaps where entire populations or times are missing and thus increasing doubts about the association of this particular pattern with the accusation.

Big data is frightening because it seems inevitable in its growth and its availability for abuse.

The hope for privacy is not necessarily lost because of this inevitability.  

We still have options by demanding our access to the same big data to defend our privacy.   We should demand from big data a capability similar to the freedom of information act: where we can  request results from queries of our own construction that can collect evidence to show that observed patterns are unreliable or ambiguous.

There a lot of inherent weaknesses in big data.  It is incomplete.  The interpreted information is ambiguous.  

It is too big to support a broad load of widespread FOIA type requests in defense against earlier big data findings.


One thought on “Big Data and Privacy

  1. Pingback: No-warrant data should be open data | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s