Privacy and the Primary Key

In earlier posts, I described labor consequences for different categories of data and how this cost is generally underestimated because there is a sense that all data is alike once it enters a data store.   In those posts I described different levels of trustworthiness of data where the less trusted data requires more frequent and higher skilled labor to scrutinize the data for continued validity for its assigned purposes.   For example, Bright data is the gold-standard of well documented and well controlled data that is requires minimal and low-skill labor for routine scrutiny.    Those posts implied higher costs for labor based on skills required for the tasks of dealing with less trustworthy data.  This post considers a different cost factor for labor, specifically the cost of trusting the analysts.

In recent years there has been a widespread acceptance that private information available in data should receive additional protections.   There are numerous laws about how such private data should be handled.   I’ll refer to all of these as PII (personally identifiable information) rules.    PII rules require that specific types of information not be exposed in any public way.  This information must be encrypted and physically protected.  If it is accessible at all, there should be a very narrow group of trusted officers who can have access to this information.   Generally, we’d expect these officers to have additional background checks, training, and obligations to protect and to not abuse this data.   These officers work under rules where violations can result in criminal charges against them.

These security requirements on the operators adds an additional costs to the labor beyond the skills involved.   The actual data skills may be very minimal but these additional burdens will increase costs.   The background checks themselves are very costly and become part of the cost of the labor.   These checks narrow the candidate pools so the costs are driven up by high demand.   Typically the checks are periodic or revocable and this can lead to a higher turn over.   The jobs themselves have higher stress due to the legal jeopardy involved.    Positions with access to PII are more expensive than positions without this access.

At first thought, it seems like PII can be easily quarantined into a limited store and thus have a very limited need for this specialized labor.   But personally identifiable information is a primary key that allows systems to match disparate data sources to fill out information about the individual, providing more dimensions of information about the individual.

For example, the personally identifiable information allows association of an employee’s office location with his position and his salary.   We may only be interested in aggregate information such as the mix of different salary-ranges in a particular area, but to get this information we need to reach back to some way to match the two pieces of information.

Theoretically, systems can be build to isolate this PII information so well that only a very small number of people will be exposed to this type of information.   This can be done by using intermediate tables to map an arbitrary non-identifying key for general use to the private identifying information.    This is how we approach new projects that will assign arbitrary identifiers and keep the private information from ever entering the system.

However, a characteristic of big data solutions is to retrieve data from existing systems and reuse this information for new purposes.   Often the existing systems have very limited missions with limited scope for the use of the data.  For example, the original mission may be to use the data only temporarily and then discard it.   Also, many existing systems are very old and will use personal information but in such a way that it is difficult to exploit for misuse.   The big data solution is to find a practical way to retrieve this data and delivery it to a central data store.   Given the engineering costs are all on the big data project, the practical approach is to have the existing systems deliver data exactly as it exists internally.    The big data project is thus burdened with the obligations to protect personally identifiable information.   In addition, the big data project needs this information in order to match records of different data sources.

Inevitably, the majority of the operators of the big data solution needs to be burdened by the additional requirements needed to protect PII that inevitably will be within their grasp.   For many projects, the additional cost of revocable background investigations and threat of legal jeopardy is inherent for all data-science positions in the project.   This is very explicit in the vast number of government positions that require very high clearances.   For most of these jobs, the high clearances are required because their duties will inevitably expose them to this sensitive information even if their tasks do not directly involve this data.

They are taking the easy technical way out.   Instead of building technologies to effectively quarantine this sensitive information, they simply make sure everyone is qualified to handle this information.    Even when there is no complaint about the additional cost (and for government, there doesn’t appear to be much concern about this cost), there remains the excessive broadening of risk.    The more people who have access to PII the more likely someone will misuse it.

There needs to be much more investment on building sanitized keys for data.   A sanitized key exposes no sensitive information but still allows for records to be combined to complete a picture.

An example of this type of large scale redesign is when companies switched over to an arbitrary employee number instead of using the equally unique identifier of the social security number.   The companies still have the tasks of matching employees with their paycheck data and their tax obligations.   They invested in the design to minimize the exposure of this information to only the places that absolutely need it.    For the rest of the business, the sanitized employee number allows for matching employees to their phone numbers, office locations, and positions within the organization chart.

Various laws are forcing companies to make these large scale end-to-end redesigns to protect this information.   Even without those laws, they still have the incentive to avoid inflating labor costs by revocable background investigations and individual-level legal obligations.    This is smart business.

On the other hand, government doesn’t have this same kind of incentive.    In government, new programs are built on top of old programs.   The old programs are managed by different departments and these programs have no resources or budget to make the types of large scale redesign needed to adopt effective methods to sanitize the personally identifiable information.    The result is that this sensitive data is moved between departments and the new program inherits all of the burdens of protecting that data.

The solution for the government is to make sure everyone is qualified to handle this data and accepts personal liability for any misuse of that data.   Even if it is acceptable, the huge numbers of people involved make it likely that someday someone will abuse this privilege.  Also the huge number of such staff makes it harder to detect this abuse or makes it take longer to investigate.   It should be much more efficient if the sensitive data can be isolated or compartmentalized effectively so that the people who have absolute need to use this information will be the only ones with access to this information.

I am recalling a recent DHS (Department of Homeland Security) solicitation to building a nation-wide system to access license-plate reader data collected privately or by local governments.

As a background, there is a boom in exploiting cheap video cameras with automated character recognition to record license plate information of vehicles within the view of the camera.   These systems are used for short-term purposes such as parking lot security, stop light or speed limit violators.

I presume that for the vast majority of these local systems, the recorded information is stored only briefly and then discarded.   Knowing storage capacities of available technology, this is a probably longer period than what would be needed: additional software would be needed to clear out old data since there is almost certainly capacity to retain this data much longer.   In any case, this is stored in isolated projects with a very narrow mission so there is little opportunity to use this data for any other purposes outside of the primary mission.

As I understand the solicitation, the request involved setting up some means to allow centralized access to this data to facilitate other law enforcement activities.   This would inevitably involve using the data over longer periods of time than originally intended.

For example, a recording of a license in a parking garage is only needed for the period the car is parked.  Although this data may be held for a period of time in case there is a need to track down a parking violation, the data itself is meant to be useful only for the duration of time the car is in the parking garage.   Reusing this data for DHS purposes will expand that window of utility to beyond what was originally intended.    There may have been a policy decision that permitted the cameras only because the information would only be useful for a limited period of time (that the car is on the premises) and not used for tracking that car elsewhere.

I understand the DHS position.   This license-reading technology is very being widely deployed all over the country.   Although there are different technologies involved, they all reduce to a common key (license plate information) that can be used to tie all this data together.   This tracking can be useful to identify certain sequences of interest for investigation.   These sequences of interest may not necessary require identifying the individual.   All that is being sought is the pattern to where to focus investigations that then can proceed to warrants to get specific information about the individual.

The problem is that the only available information at the source is the identifiable information of the license plate.   License plate information is not PII.  License plate information is publicly visible information.   There is no expectation of privacy.   However, it can still be linked to an individual vehicle owner.   We would prefer to postpone that linkage in an investigation only after sufficient evidence exists to suspect guilt.

I have no idea what specific plans DHS had for its solicitation.   I am imagining a scenario where they are looking for patterns to narrow investigations.

Such patterns could be possible by finding some arbitrary identifier that the individual sources can substitute in place of the license plates.   The identifier might be a one-way hash that includes not only the license plate information but the calendar date.   The one-way hash is one that produces a unique identifier that can not be decrypted to reveal the original plate information.  The additional seed value of the calendar date would limit its utility to a specific period of time.   To make this work, all of the sensors would have to standardize to the same algorithms so that the resulting keys can be matched where ever else the same vehicle is observed.

The government could propose a rule that would require license plate readers to implement a common hash algorithm to permit reusing this information on larger scales.   This would require more regulation and legislation that in turn will face opposition due to increased costs both by vendors and consumers of these technologies.   Even if this were easily passed, it will take a long time for the license plate readers to be replaced with the newer technologies.

It is just so much easier to just export the data as it exists inside the individual readers and send it directly to a central location that will then assume full responsibility for handling that data.   This is simpler in technical terms.  It can be done relatively more quickly.   We already establish some trust in the operation of DHS and its staff to not abuse this information.

It still is a not a wise solution.  It is not wise to replace a technical solution that can assure safe handling of information with a labor solution the promises to regulate staff to safely handle this information.   The promise of safe handling is placed at the hands of the individual who access to this information.  Also, the promise can change over time with a simple policy change directing the operators to use the information in way not originally acceptable: for example to actively track individuals by name without a warrant.

To summarize, big data solutions can provide valuable insight by finding patterns of properties that do not identify individuals.  However, to build those properties, it needs some kind of primary key to match data from multiple sources.  The most convenient key is one that exposes personally identifiable information.    The easiest solution is to propagate this sensitive information into the large data store and thus exposing this data to large populations of users.   We address this concern by imposing costly qualifications on these staff through revocable background investigations and through threats of criminal charges for any abuse.   The choice of a labor solution instead of a technical solution puts the sensitive data at risk of some misbehaving analyst or of some policy change expanding the use of this data beyond its originally approved purposes.

The increased cost of this labor will discourage us from expanding the labor to handle the data quality issues I have previously raised.   We need to treat all data as gold-standard data because we can’t afford the sensitive-data-qualified labor to scrutinize more suspect types of data.


2 thoughts on “Privacy and the Primary Key

  1. Pingback: Data Quality, Governance, Trust when some people don’t play nice | kenneumeister

  2. Pingback: Big data can re-identify de-identified data | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s