Big data can re-identify de-identified data

Last year, I wrote a post where I tried to extrapolate from my experience with data to try to understand the challenges of protecting privacy in data.  I concluded then that it was probable that the private information would eventually be revealed as more data dimensions are added to an analysis.  Thus, the project should be staffed with people who are properly trained and authorized for access to that private data in the rare occasions that an analysis accidentally reveals the private information.

In subsequent posts, I tried to show how the design approaches I used in my projects could be helpful in protecting privacy.

My projects used a multiple step process to simplify a the implementation of a large scale project with a small team.  I took a divide-and-conquer approach to the large problem to define smaller projects with distinct database schema and dedicated processing resources so that eventually the series of projects will accomplish the larger goal.   In my hindsight attempt to describe this sequence of processes, I compared it to an industrial supply chain with upstream providers delivering finished products that would become components of the larger product.  The final step would like a brand-name factory shipping packaged products ready for retail.  Again, my motivation for this design as an approach was to start with simple functionality and then spiral out to meet a larger more complex functionality.  It ended up looking like a supply chain.   After identifying the supply chain analogy, I wondered whether this approach could be useful for protecting compartmentalized data.

I took the idea a step further to describe a different approach of data processing where the data owner would keep his data in house.  Instead of delivering the source data in bulk for loading into downstream data warehouses, he would offer data enrichment services to his customers downstream of the supply chain.  In this model, the downstream analysts would provide detailed structured queries for the precise data they want enriched with the source’s data.   The data source owner would process the query using his own resources and his local data store and deliver back only the enrichment that matched the specific queries.  I imagined that typically that enrichment would involve aggregated results into broad categories (generic definition of map/reduce).   The data-source owner would  inspect his processed result before delivering it back to the requester.   That quality control check to be sure that the enrichment meets his standards (or contract terms) and his obligations for protecting against unauthorized disclosure of sensitive information.

This approach benefits the data source owner in many ways.   The data source owner best understands his own data and how it should be used to enrich outside data.  In contrast, the currently common approach of releasing bulk data puts the reputation of the data source at risk due to the incompetence of unknown and unpredictable users of that released data.   It would be safer to never allow the raw source data to leave the data source at all.

I proposed that this could be a strategy for protecting the sensitive information by never releasing the data in the first place.  For example, in a medical record scenario, the health care provider possessing patient records could offer a query-handling services where clinical researchers may submit specific categorizing & summarizing queries selected from a catalog of query services offered by the health care provider.   The health care providers would run the query on his own data systems and return only that relevant matching information after quality-control checking for completeness and no release of privacy-protected data.

Although I haven’t used their services, I noticed that some data providers do offer similar services, such as those offered in Microsoft’s Azure data marketplace.  The data services offer specialized data that they they make available through individual transactions (typically returning one record).   It is possible that the marketplace transaction engine can be available as a commercial product for medical providers.

The more common approach in data systems is to build larger data warehouses by negotiating bulk transfers of data from smaller data warehouses.  The data transfers are typically exhaustive so that the source loses any control over how that data is used later.   This approach made sense historically when data storage and processing were expensive.  The data source welcomed the opportunity to release the data to a central data warehouse so it can economize on its primary task of collecting new data.

For enterprise data warehouses, this approach presents no concerns because all of the sub-entities belong to the same enterprise.  A large corporation can assume that it should have access to any information generated within any part of the organization.   Certainly, there may be more carefully controlled access to the larger data warehouse system, but that data warehouse would have access to very detailed data about the enterprise.

The data warehouse technologies are optimized over decades of development to support this approach of data warehouse.   The approach is implicit in the very name data warehouse that this is final resting place for storing all data.  Smaller upstream implementations of same technology are sometimes called Datamarts.

The problem in the health care industry is that there is no fully integrated enterprise that can claim ownership of the broad accumulation of all data relevant to health care.   Distinct enterprises exist for a wide variety of health care providers.  Health care providers are distinct from insurance companies (health care payers), pharmacy dispensaries, pharmaceutical companies, medical device companies, medical clinical researchers, etc.

The enterprise data warehouse model is a poor choice to use for medical industry (in US) because the industry is not integrated into a single enterprise responsible for all of the functions.   However, the enterprise data warehouse technologies (and related big data technologies) are very mature to offer powerful benefits at low cost.   It may be prohibitively expensive to build a whole new technology scheme such as my proposal of data enrichment close to the source, and the result may be less capable than the data warehouse (or data lake) approach.

One of the challenges of sharing data within the medical ecosystem is compliance with HIPAA rules protecting patient privacy.  To exploit the big data technologies, there is a need to share data in bulk but HIPAA demands that data to be sanitized against unauthorized disclosure of patient privacy.   The solution is to implement some form of de-identification that strips the identity from the shared data while retaining as much value as possible for the later analysis.

I only recently encountered the debate about the challenges of de-identification.  In particular, I admit up front that I have not studied this issue in any depth.  Instead I’m interpreting the debate from my own experience of how hard it is to prevent rediscovery of deliberately removed sensitive information when the data are combined with a large number of other data sources.   From my experience, the data hidden from attempts at sanitization will inevitably be revealed when combined with enough dimensions of additional data.

Based on my minimalist understanding of the de-identification debate for medical data, it appears there is a debate about how strong to make the de-identification and then how to prove that a particular implement meets that goal.

The ultimate in de-identification is complete confidence of the impossibility of re-identification.   This would hard to prove.

To me this is similar to the goal of computer security products trying to protect against any type of malicious attack.  After watching this the progress of computer security over the past three decades, I’m still disappointed to learn of news describing new exploits against what we supposed should be very secure systems.   I imagine the same kind of experience will occur with any approach claiming high-confidence of the impossibility of re-identification.  It is just a matter of time before someone proves them wrong.

A perfect protection against re-identification is likely to be stripped of too much dimensional information to be useful in most types of analysis.   For example, a hospital report of annual totals of various treatments would have nearly zero risk of identifying the patients, but this data would not be useful for clinical researchers studying the sensitivity for different variables such as treatment schedules and the sex, age and race of the patient.   Addition of this detail to satisfy the analyst increases the risk of identifying the patient.

A more pragmatic approach to de-identification appears (to me) to be more common.  This approach accepts some very slight risk of re-identification.  For example, k-anonymity is an approach to deliver data so that the identity can only be narrowed to a specific number of actual patients.  Accompanying this approach is an assurance of how improbable it will be for the process to ever release information that can be narrowed to fewer than k patients.

It seems to me that the calculation of the likelihood of re-identification assumes a scenario of a malicious attacker making a deliberate effort to identify a specific case.   For example, someone may be determined to find the patient identity matching for clinical record for a specific cancer treatment.   An alternative targeting example would be someone attempting to find health information about a specific individual.

These are important problems but from my experience these will be far rarer than the accidental recovery of many identities from analysis of data with a large number of dimensions of data.  The analysis was not targeting either the identity for a specific case, or a case for a specific person.   Instead the analysis produces unique patient-case combinations in its outputs from a multidimensional or map-reduce query.   Such a query will produce a large number of summaries where a subset of the results may reveal unambiguous associations of medical information for a specific person.

Based only on my distantly related experience, I would expect that this risk increases as the analysis involves more dimensions as is characteristic of higher level policy making.  The following is an attempt at providing a fictional example in medicine.

Suppose there is a de-identified clinical database of treatments and outcomes for a certain class of cancers.  The records are of individual cases but the records have been de-identified to some level of k-anonymity with the assurance that it is unlikely to identify an individual patient.  Among the data available in this database are the treatment type and schedule including the prescription drugs.

The clinical researchers also have information about insurance companies, including the terms of their different policies and what markets those policies serve.   The insurance information includes the formularies of the prescriptions they make available to patients through co-insurance or discounts.

This study is part of a larger study of overall community health so they have access to similarly de-identified daily attendance data (such as timesheets) from a select number of local employers, schools, and other institutions.  This attendance data is not comprehensive, but provides a sample for the purpose of understanding lost productivity due to all illnesses.   Perhaps this study includes a comparison of frequency of absences with the frequency of various health conditions.

This study also has a dataset of license-plate reader data for parking garages of major hospitals.  Perhaps the rationale for this is to observe repeat visitors to the hospital compared with one-time visitors.  In terms of total visitors, this data is a small fraction of all visitors but it is considered a useful proxy for all visitors.

From my experience, I could go on with many other data sources that may be available for policy-making level analysis.  Browsing the data sources in the Azure marketplace provides a lot of interesting other options.  My points is that there are a lot of data trails available that can be matched at least tentatively or vaguely to clinical records.

In the analysis of this data, the data science team studies the various data sets to find column combinations that can serve as useful keys for matching records.   This process involves tools to measure the uniqueness of various column combinations.  The selected columns do not have to be unique because the future queries will involve some algorithms to handle the duplicates as well as the inevitable nulls of non-matching records.  Part of the null and duplicate handling may involve manually prepared tables to provide default values for nulls or flags to indicate most-likely match among many.

The team has no intent to identify the patient.  Instead, their motivation for these studies is to provide drill-down details to support verification of higher level analytic results.   For example, the goal of the drill-down verification study may be to see a sample of treatment types that results in a particular number of repeat visits to the hospital.

In one of the verification studies involving this drill-down data they encounter a subset of unique matches of clinical records with license-plate readers and absentee data.   This set of unique matches would a scattering of cases with no particular pattern.  Although there was no intention to target a particular case or a particular individual, the matched records may include information such as:

  • diagnosis and treatment
  • Age, sex, ethnicity of patient
  • License plate of vehicle most likely transporting patient for recurring visits
  • Employer whose insurance policy has restrictions that forced the use of an unusual prescription
  • Absentee information consistent in time period and frequency for the recurring treatments

For this example, the actual individual is not positively identified in the data itself, but if this data were released to a broader audience, there is a far higher risk of identifying the individual than first assumed with the k-anonymity de-identification.

Higher level studies will likely have vastly more data dimensions to work with including personal data from public records.   With this broader data, the explicit identification can occur directly in the the query results, or be obvious enough for an analyst to mentally piece the information together.

Again, I emphasize there is no intent to match either a particular name to the medical record or a medical record to a particular name.  Also because the supplemental data is an incomplete sample, only a small subset of query results will be so precise and their combinations are indiscriminate.  The re-identification is an accidental query result.    However, this accidental re-identification will be available to everyone who has access to the tool that performs the query.

The tool may be available to a broad population of analysts some of whom may publish their analysis to a broader community.  Often these published results are in the form of recurring reports involving the same set of queries using new datasets.  Scripted queries may automatically generate the embedded tables and charts for these reports.  There may be some script to produce the entire report leaving the analyst with only the task to review the report before approving it for distribution.

There is still a protection here with the human analyst being able to recognize the automatically populated data is inappropriate for distribution.  The problem is that these are very routine reports and the scripts are so mature that the reports may receive very little attention.  The process relies on a human to scrub the PII from the automated report.  Unfortunately, humans vary in terms of their reliability of employing diligence.

Released accidentally re-identified data probably will not be discovered until much later, perhaps even too late to perform remedial actions.

The above scenario is a fable I invented to illustrate how automated analytics of multi-dimensional data may re-identified previously de-identified data.   Again, I emphasize I have no background in medical clinical data or its practices.  Perhaps they do have more stringent policies that prevent this scenario.   However, I strongly suspect this scenario is possible and even very common.

The enthusiasm for the benefits of big data comes from widely promoted reports of past successes.  The promise of big data techniques is that it can provide similar successes in other contexts.   Big data involves volume, velocity, and variety.  The volume and velocity depend on automated queries and report building.  The variety introduces the opportunity for new benefits.  The combination of automation and opportunity from variety is what makes re-identification possible or even very likely.

The most exciting benefit of big data is it ability to identify new hypotheses, previously unknown explanations for the data.   One possible new hypothesis is an identification of a medical record.

The volume and velocity of the data results in volume and frequency of new reports.  New, very extensive reports will appear too quickly for analysts to thoroughly study manually.   At the same time, the value of the new hypotheses, in the form of trends or recommendations perishes quickly so there is a need to distribute the results quickly.  Big data technologies inevitably exerts pressure to automate the decision maker.  Big data inspired processes marginalizes or eliminates the human participation in the analysis process.

In my above scenario, it should be easy to add a filter to the output of each new analytic query to redact the personal data.  There are two problems with this expectation.  The first problem is that the potential for revealing this information needs to be recognized.  Usually this happens only after the information was previously revealed.   The second problem is that sometimes the information that appears clean to an algorithm can be sufficient for a human to perform the final synthesis to complete the re-identification.

With big data processing, re-identification of de-identified data appears to me to be inevitable.

One characteristic of big data is the push to move detailed data closer to the higher level decision makers or policy makers.

In commercial enterprises, the big data is available at the executive (C-suite) level who have immediate access to automated dashboards showing key performance indicators (KPI) of their business.   These KPI dashboards use automated queries that have access to extensive detailed information about their business.  At first glance, the KPIs appear necessarily to be very highly summarized.  The indicators are stoplight symbols (green means all is well), trend arrows (upward arrow means things are getting better), etc.   However, these dashboards are complex infographics that include a large variety of indicators.   The dashboard has a consistent layout to allow the executive to quickly spot changes from what he remembered in the previous report.

Although the dashboard or infographic lacks detailed information, the executive consumer generally has extensive in-depth knowledge of his business.  The executive will approach each new iteration of this standard dashboard with specific expectations.  He will demand explanations for any indicator that fails to match his expectation.   The acceptable explanation must be a story that makes sense in context of the actual operations of his business.   The executive will have access to a dedicated team of data-scientists who can operate the automation or run ad-hoc reports to provide detailed information to create a plausible story to explain the unexpected indicator.

Perhaps the KPI exposed the fact that monthly sales missed its target.  The detailed story presented to executive may be:

On the first Friday of the month, new recent hired IT member named X misapplied a patch to the shipping department’s server causing a 4 hour outage that prevented shipments that included packages for customer Y whose CIO writes a blog in an top industry periodical Z where one article disparaged our company causing a delay in closing a deal with customer W.

The data available to the executive’s data science team will permit them to piece this data together despite the fact that the individual departments provided only sanitized data to the executive team.  The time of the outage is available in the sanitized service IT service ticket data.  The shipping department reported a delay due to server outage.  The sales department reported recent sale for a customer with an expected delivery that required shipping that day.  The HR department provided data about new hires and their new departments.  The benefits department reported the enrollment rosters for retraining.  The marketing department provided a report on recently publicity of their business.  And so forth.  Perhaps prior the presentation of this story to the executive, one of the data scientists finds confirming information about recent work history and skills in X’s LinkedIn profile.

Diligent executives expect stories that are similarly explicit.  He wants to know what caused the unexpected results and he wants to know that the problem has been discovered and addressed.  The data science team will make every effort to get as close to this level of detail as they can.   Big data tools and the breadth of available data makes this possible.

The value of this specificity in explanation will motivate the executive to continue to finance the data scientists to continue to improve their capabilities with more extensive data and more powerful big data tools.  The executive will also support his team in negotiations with individual departments concerning how to sanitize their data better to provide even more details the executive needs to make decisions.

In the above case, the individual departments felt comfortable protecting what they felt should be close-held.  The power of the big data tools was able to piece together the story.  The final story emerges in the same way that a skilled journalist or criminal investigator can piece together a story from a variety of clues: the last step is the human skill of storytelling.


5 thoughts on “Big data can re-identify de-identified data

  1. I wanted to add to the above discussion what I call the power of the table-join. The power comes from the ability to eliminate all of the non-candidates for matching a particular record. Even on-key constrained joins (joins of arbitrary relationships) are very efficient at eliminating the bulk of the candidates that do not share the correct attributes.

    Many scenarios of the re-identification scenarios suggest some malicious actor who is trying to read from the data the information necessary to identify the individual. Implicit in this scenario is an assumption that the universe of possible answers is infinite. The possibilities are infinite for practical purposes when done manually with individual queries. However, this is not the problem with modern database technologies.

    There are only 7.2 billion people on this planet. This is a small number for database query. A simple join based on age, sex, and ethnicity reduces the candidate population to just a few thousand. It does not take much to quickly shrink this smaller candidate pool by eliminating the people with data trails during the treatment period that would be inconsistent with the treatment schedule, such as tying them to certain locations or activities that would be unlikely for someone undergoing treatment such as chemotherapy.

    In the scenario I propose, where there is analytic query that runs against entire collections of possibilities, there are bound to be some categories where all but one of the potential candidates are eliminated by the join condition. The result is a re-identification by a process of elimination, but the join (or map-reduce script) process is very efficient at this elimination.

    My scenario of a query result that returns unique results for certain categories is unlike the feared malicious attacker trying to undo the de-identification of particular medical record. The query result categories with unique records will be unexpected as to which records are re-identified.

    Also a particular data project may offer a large catalog of queries where each one has the potential of exposing different groups of uniquely populated categories.

    These query results may not satisfy the goals of the malicious attacker seeking data on a specific person or condition but they will exist. Such disclosures could trigger costly penalties or reporting burdens even in the absence of any attempt to abuse that information.

  2. Recent circulating math puzzle gives an example that reminds me of a multidimensional scenario of discovering an identity when two studies are working from de-identified data. In particular, the identifying information is a specific birth date but the two studies are given different de-identified versions: one knows the month but not the day, the other knows the day but not the month.

    The article explains a key assumption in the solution is that the two participants are answering honestly about their ignorance of knowledge of the exact date. In a corresponding data analytic case, this would be to assume the two analyses are accurate and not biased.

    More relevant to my post is the fact that they have access to the entire population of possible choices. In big data problems involving human populations, it is reasonable to know the entire population of all living humans. Even though that number is 7.2 billion, it is not infinite. A few contrasting studies involving a few dimensions will mimic the mentioned puzzle, allowing for discovering private information about some subset.

    The discussion of the puzzle’s solution illustrates the process of elimination of non-qualifying choices until there is only one option remaining. The word problem statement challenges the human mind to work out the set elimination steps. However, if this were stated as a data analytic problem, table joins would eliminate the non-qualified choices quickly.

    Again, the risk is not that any person’s birthday may be discovered using this puzzle solution approach. There may be enough redundancy in a population to eliminate this possibility in this 2-dimensional problem. However, the risk remains that some sub-population may be identified due to the lack of redundancy for their combination of measures.

  3. Pingback: Economy of compensated opinions in a dedomenocracy | kenneumeister

  4. This article provides a brief description of the partnership to combine individual longitudinal monitoring data from Apple devices to large scale population analytics platform based on IBM’s Watson technology. The article relays the unchallenged confidence of protecting privacy of this detailed data that will be available to researchers:

    As IBM receives Apple’s data, it will de-identify and store it in a secure and scalable cloud system. Researchers, doctors, and other health professionals will be able to view and share the data, as well as access data-mining and predictive analytics capabilities.

    But then this platform will eventually integrate with other data tools and data sources:

    Apps that run through HealthKit and ResearchKit will be connected to Health Cloud through a delivery platform. This process will facilitate easy data storage, aggregation, and modeling. It will also have the ability to combine information with other research.

    Health conditions and treatments have widespread impacts on behaviors of patients and people they interact with. These behaviors can be monitored outside of the healthcare context. Their data can be combined with Watson’s confidently de-identified data with the express intent of identifying individualized care options. Patient identification with their data is inevitable as the data becomes more more readily available.

    Here is another article that mentions the Apple + Watson partnership with a decidedly personalized benefit that can only come from identified data:

    MD: What’s really interesting about our partnership with Watson is being able to take all this unstructured data from a consumer’s life and bring it back to them in a way, with tasks and reward and opportunities, to help them make healthier choices. That could be location services, marrying that with health information to make recommendations. So when you get off of a plane, Watson can recommend a gym and where to eat lunch. To bring together all that unstructured data, and present it in a way that’s meaningful, that really has the opportunity to be groundbreaking.

    That type of analytics is only possible if analytics has access to both private and public data at the same time and with conforming keys (personal identifiers). This seems an admission of the potential for Watson to pirate personal data for its own profit.

  5. Pingback: Wearable health technologies, such as fitness trackers, can compromise HIPAA data | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s