What Information is "Personally Identifiable"?

Share It

Mr. X lives in ZIP code 02138 and was born July 31, 1945.

These facts about him were included in an anonymized medical record released to the public. Sounds like Mr. X is pretty anonymous, right?

Not if you're Latanya Sweeney, a Carnegie Mellon University computer science professor who showed in 1997 that this information was enough to pin down Mr. X's more familiar identity -- William Weld, the governor of Massachusetts throughout the 1990s.

Gender, ZIP code, and birth date feel anonymous, but Prof. Sweeney was able to identify Governor Weld through them for two reasons. First, each of these facts about an individual (or other kinds of facts we might not usually think of as identifying) independently narrows down the population, so much so that the combination of (gender, ZIP code, birthdate) was unique for about 87% of the U.S. population. If you live in the United States, there's an 87% chance that you don't share all three of these attributes with any other U.S. resident. Second, there may be particular data sources available (Sweeney used a Massachusetts voter registration database) that let people do searches to bootstrap what they know about someone in order to learn more -- including traditional identifiers like name and address. In a very concrete sense, "anonymized" or "merely demographic" information about people may be neither. (And a web site that asks "anonymous" users for seemingly trivial information about themselves may be able to use that information to make a unique profile for an individual, or even look up that individual in other databases.)

Many contemporary privacy rules and debates center on the notion of "personally identifiable information" (PII). The PII concept is used by several legal regimes and many organizations' privacy policies; generally, information that identifies a particular person is considered much more sensitive than information that does not. For instance,

Federal telecommunications privacy laws use "individually identifiable information" (about a subscriber) as a basis for the category of protected information called Customer Proprietary Network Information (CPNI);
Federal health privacy regulations use "individually identifiable health information" (about a patient) as a basis for the category called Protected Health Information (PHI);
Federal financial privacy laws, the EU Data Protection Directive, and state privacy laws all employ similar terms and concepts;

and, in each case, facts deemed "personally identifiable" or "individually identifiable" may receive dramatically higher protections under these laws and regulations.

But research by Prof. Sweeney and other experts has demonstrated that surprisingly many facts, including those that seem quite innocuous, neutral, or "common", could potentially identify an individual. Privacy law, mainly clinging to a traditional intuitive notion of identifiability, has largely not kept up with the technical reality.

A recent paper by Paul Ohm, "Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization", provides a thorough introduction and a useful perspective on this issue. Prof. Ohm's paper is important reading for anyone interested in personal privacy, because it shows how deanonymization results achieved by researchers like Latanya Sweeney and Arvind Narayanan seriously undermine traditional privacy assumptions. In particular, the binary distinction between "personally-identifiable information" and "non-personally-identifiable information" is increasingly difficult to sustain. Our intuition that certain information is "anonymous" is often wrong. Given the proper circumstances and insight, almost any kind of information might tend to identify an individual; information about people is more identifying than has been assumed, and in the long run the whole enterprise of classifying facts as "PII" or "not PII" is questionable.

Statistical inference and clever use of databases has resulted in impressive examples of deanonymization of supposedly anonymous data, the kinds of data that most organizations have not regarded as PII. Apart from combinations of demographic data, some of the sorts of things that may well uniquely identify you include your search terms; your purchase habits; your preferences or opinions about music, books, or movies; and even the structure of your social networks -- in a purely abstract sense, even when shorn of the identities of your friends and contacts. Deanonymization is effective, and it's dramatically easier than our intuitions suggest. Given the number of variables that potentially distinguish us, we are much more different from each other than we expect, and there are more sources of data than we realize that may be used to narrow down exactly who a particular record refers to.

Many of these papers were meant as proofs of concept: they show that people can potentially be re-identified by these kinds of data, not that everyone will be. Not everyone's medical records were as easy to put a name to as Governor Weld's. And Narayanan and Shmatikov's research definitively identified only two Netflix users from their movie ratings -- not every user whose ratings were published by Netflix. Still, many of these research results deliberately do not use all the data available about individuals because their goal is to show the effectiveness of mathematical techniques, not to violate individuals' privacy. Real-world attacks will use many more kinds of available information simultaneously to narrow in on people's identities. As Bruce Schneier has observed, such attacks only get better over time; they never get worse.

Ohm argues that it's more appropriate to think of identifiability as a continuum. The notion of "anonymized" or "sanitized" data is then problematic; researchers habitually share, or even publish, data sets which assign code numbers to individuals. There have already been conspicuous problems with this practice, like when AOL published "anonymized" search logs which turned out to identify some individuals from the content of their search terms alone.

We hope "Broken Promises of Privacy" encourages people who work with personal data to think more critically about their retention and sharing practices and the effectiveness of the anonymization or pseudonymization techniques they're using. We also hope it finds a broad audience and helps start a wider discussion among researchers, technologists, and lawyers about what "privacy protection" should mean in the era of deanonymization.

Related Issues

Privacy

Anonymity

Related Updates

Deeplinks Blog by Joe Mullin | July 28, 2026

Why Are Gay Bars Building Databases of Their Patrons?

Recent reports have raised alarm about the use of PatronScan, an ID-checking and face-scanning system, at multiple LGBTQ+ bars in San Francisco’s Castro neighborhood. Much of the attention has focused on reports that the system photographs patrons as they enter venues and questions about whether those images are...

Deeplinks Blog by Hayley Tsukayama | July 20, 2026

Protect Your Privacy with California's DROP Tool

Are you a California resident? Then we've got exciting news for you: there's a tool just for you that lets you take a single, relatively easy step to protect your privacy. It's called a DROP request. (That's Delete Request and Opt-out Platform, if you're fancy). This one bit of paperwork...

Deeplinks Blog by Dave Maass | July 17, 2026

How the Watch Dogs Video Game Series Mirrored and Predicted Real-World Digital Rights Issues

When Ubisoft's Watch Dogs 2 was released in 2016, it was a headtrip for those of us working on digital-rights issues in the Bay Area. The game's missions often felt like they were ripped from the pages of EFF's Deeplinks blog.

Deeplinks Blog by Thorin Klosowski | July 15, 2026

Most Smart Watches, Rings, and Bands Lack Basic Transparency Reports and Key Privacy Features

Oura Rings, Garmin GPS fitness watches, Apple Watches, Whoop bands—every year, more and more tech devices are promising to monitor our health and fitness, guide us toward healthier living, and provide useful health metrics to take to our doctors. But few of these tools provide the sorts of privacy and...

Deeplinks Blog by Rory Mir | July 14, 2026

Don’t Repeat NY’s 3D Printing Blunder

This year the state of New York had the dubious honor of being the first to pass a controversial provision to mandate all 3D printers come with surveillance and censorship. That means not only is there a ticking clock to protect every artist, researcher, engineer, and hobbyist in the state...

Deeplinks Blog by India McKinney | July 9, 2026

The House Passed The KIDS Act—The Senate Should Reject It

Last week, the House voted on the KIDS Act, a disjointed package of legislation that seeks to control Americans’ web browsing and private messaging. The package combines a revised version of the Kids Online Safety Act (KOSA), with several other internet bills, study bills, reporting requirements, and new...

Deeplinks Blog by Paige Collings | July 2, 2026

LGBT Q&A: How Can I Wipe Online Data That Points To My Queer Identity?

This Pride, we’re answering all your digital rights questions in season two of our initiative, LGBT Q&A. You Asked: Is there a way for me to wipe data about me online that could point to my queer identity?

Deeplinks Blog by Paige Collings, Erica Portnoy | June 30, 2026

LGBT Q&A: What Data Are Companies in the UK Collecting When Verifying My Age?

This Pride, we’re answering all your digital rights questions in season two of our initiative, LGBT Q&A. You Asked: I live in the UK, and we have age verification now on a bunch of websites (including Reddit) and now on iPhones. Can you explain what sort of data companies...

Deeplinks Blog by Lena Cohen, Paige Collings | June 26, 2026

EFF to Grindr: This Pride Month, Put Safety and Privacy Over Profits

This Pride month, we’re calling on the dating app Grindr to prioritize LGBTQ+ user safety by making privacy the default across its platform. That means no more sharing personal data with advertisers or training AI on private information without users’ opt-in consent.

Deeplinks Blog by Thorin Klosowski | June 26, 2026

Hate “The Algorithm?” RSS Is One of the Tools You’ve Been Looking For

Since at least the moment Facebook introduced (and apologized for) its News Feed, “the algorithm” has been shorthand for the ways the tech giants control what we see and when we see it. In the age of enshittification, there is a push to reclaim our feeds and networks. Good news...

Related Issues

Privacy

Anonymity

What Information is "Personally Identifiable"?

What Information is "Personally Identifiable"?

Related Issues

Related Updates

Related Issues

Follow EFF:

Contact

About

Issues

Updates

Press

Donate