Debunking the Myth of “Anonymous” Data

English

Today, almost everything about our lives is digitally recorded and stored somewhere. Each credit card purchase, personal medical diagnosis, and preference about music and books is recorded and then used to predict what we like and dislike, and—ultimately—who we are.

This often happens without our knowledge or consent. Personal information that corporations collect from our online behaviors sells for astonishing profits and incentivizes online actors to collect as much as possible. Every mouse click and screen swipe can be tracked and then sold to ad-tech companies and the data brokers that service them.

In an attempt to justify this pervasive surveillance ecosystem, corporations often claim to de-identify our data. This supposedly removes all personal information (such as a person’s name) from the data point (such as the fact that an unnamed person bought a particular medicine at a particular time and place). Personal data can also be aggregated, whereby data about multiple people is combined with the intention of removing personal identifying information and thereby protecting user privacy.

Sometimes companies say our personal data is “anonymized,” implying a one-way ratchet where it can never be dis-aggregated and re-identified. But this is not possible—anonymous data rarely stays this way. As Professor Matt Blaze, an expert in the field of cryptography and data privacy, succinctly summarized: “something that seems anonymous, more often than not, is not anonymous, even if it’s designed with the best intentions.”

Anonymization…and Re-Identification?

Personal data can be considered on a spectrum of identifiability. At the top is data that can directly identify people, such as a name or state identity number, which can be referred to as “direct identifiers.” Next is information indirectly linked to individuals, like personal phone numbers and email addresses, which some call “indirect identifiers.” After this comes data connected to multiple people, such as a favorite restaurant or movie. The other end of this spectrum is information that cannot be linked to any specific person—such as aggregated census data, and data that is not directly related to individuals at all like weather reports.

Data anonymization is often undertaken in two ways. First, some personal identifiers like our names and social security numbers might be deleted. Second, other categories of personal information might be modified—such as obscuring our bank account numbers. For example, the Safe Harbor provision contained with the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that only the first three digits of a zip code can be reported in scrubbed data.

However, in practice, any attempt at de-identification requires removal not only of your identifiable information, but also of information that can identify you when considered in combination with other information known about you. Here's an example:

First, think about the number of people that share your specific ZIP or postal code.
Next, think about how many of those people also share your birthday.
Now, think about how many people share your exact birthday, ZIP code, and gender.

According to one landmark study, these three characteristics are enough to uniquely identify 87% of the U.S. population. A different study showed that 63% of the U.S. population can be uniquely identified from these three facts.

We cannot trust corporations to self-regulate. The financial benefit and business usefulness of our personal data often outweighs our privacy and anonymity. In re-obtaining the real identity of the person involved (direct identifier) alongside a person’s preferences (indirect identifier), corporations are able to continue profiting from our most sensitive information. For instance, a website that asks supposedly “anonymous” users for seemingly trivial information about themselves may be able to use that information to make a unique profile for an individual.

Location Surveillance

To understand this system in practice, we can look at location data. This includes the data collected by apps on your mobile device about your whereabouts: from the weekly trips to your local supermarket to your last appointment at a health center, an immigration clinic, or a protest planning meeting. The collection of this location data on our devices is sufficiently precise for law enforcement to place suspects at the scene of a crime, and for juries to convict people on the basis of that evidence. What’s more, whatever personal data is collected by the government can be misused by its employees, stolen by criminals or foreign governments, and used in unpredictable ways by agency leaders for nefarious new purposes. And all too often, such high tech surveillance disparately burdens people of color.

Practically speaking, there is no way to de-identify individual location data since these data points serve as unique personal identifiers of their own. And even when location data is said to have been anonymized, re-identification can be achieved by correlating de-identified data with other publicly available data like voter rolls or information that's sold by data brokers. One study from 2013 found that researchers could uniquely identify 50% of people using only two randomly chosen time and location data points.

Done right, aggregating location data can work towards preserving our personal rights to privacy by producing non-individualized counts of behaviors instead of detailed timelines of individual location history. For instance, an aggregation might tell you how many people’s phones reported their location as being in a certain city within the last month, but not the exact phone number and other data points that would connect this directly and personally to you. However, there’s often pressure on the experts doing the aggregation to generate granular aggregate data sets that might be more meaningful to a particular decision-maker but which simultaneously expose individuals to an erosion of their personal privacy.

Moreover, most third-party location tracking is designed to build profiles of real people. This means that every time a tracker collects a piece of information, it needs something to tie that information to a particular person. This can happen indirectly by correlating collected data with a particular device or browser, which might later correlate to one person or a group of people, such as a household. Trackers can also use artificial identifiers, like mobile ad IDs and cookies to reach users with targeted messaging. And “anonymous” profiles of personal information can nearly always be linked back to real people—including where they live, what they read, and what they buy.

For data brokers dealing in our personal information, our data can either be useful for their profit-making or truly anonymous, but not both. EFF has long opposed location surveillance programs that can turn our lives into open books for scrutiny by police, surveillance-based advertisers, identity thieves, and stalkers. We’ve also long blown the whistle on phony anonymization.

As a matter of public policy, it is critical that user privacy is not sacrificed in favor of filling the pockets of corporations. And for any data sharing plan, consent is critical: did each person consent to the method of data collection, and did they consent to the particular use? Consent must be specific, informed, opt-in, and voluntary.

Related Issues

Privacy

Locational Privacy

Join EFF Lists

Related Updates

Deeplinks Blog by Cindy Cohn | November 6, 2024

The 2024 U.S. Election is Over. EFF is Ready for What's Next.

The dust of the U.S. election is settling, and we want you to know that EFF is ready for whatever’s next. Our mission to ensure that technology serves you—rather than silencing, tracking, or oppressing you—does not change. Some of what’s to come will be in uncharted territory. But we have...

Deeplinks Blog by Adam Schwartz | November 1, 2024

The Human Toll of ALPR Errors

This post was written by Gowri Nayar, an EFF legal intern. Imagine driving to get your nails done with your family and all of a sudden, you are pulled over by police officers for allegedly driving a stolen car. You are dragged out of the car and detained at gun...

Deeplinks Blog by Christian Romero | October 31, 2024

"Is My Phone Listening To Me?"

Whether you’re just starting to question some of the effects of technology in your life or you’re the designated tech wizard of your family looking for resources to share, Digital Rights Bytes is here to help answer some common questions that may be bugging you about the devices you use.

Deeplinks Blog by Rindala Alajaji, Hayley Tsukayama | October 30, 2024

Triumphs, Trials, and Tangles From California's 2024 Legislative Session

California’s 2024 legislative session has officially adjourned, and it’s time to reflect on the wins and losses that have shaped Californians’ digital rights landscape this year.EFF monitored nearly 100 bills in the state this session alone, addressing a broad range of issues related to privacy, free speech, and innovation. These...

Deeplinks Blog by Rindala Alajaji | October 16, 2024

Preemption Playbook: Big Tech’s Blueprint Comes Straight from Big Tobacco

Big Tech is borrowing a page from Big Tobacco's playbook to wage war on your privacy, according to Jake Snow of the ACLU of Northern California. We agree. In the 1990s, the tobacco industry attempted to use federal law to override a broad swath of existing state laws and...

Deeplinks Blog by Hayley Tsukayama | October 15, 2024

EFF to New York: Age Verification Threatens Everyone's Speech and Privacy

Young people have a right to speak and access information online. Legislatures should remember that protecting kids' online safety shouldn't require sweeping online surveillance and censorship.EFF reminded the New York Attorney General of this important fact in comments responding to the state's recently passed Stop Addictive Feeds Exploitation (SAFE)...

Deeplinks Blog by Karen Gullo | October 10, 2024

New IPANDETEC Report Shows Panama’s ISPs Still Lag in Protecting User Data

Telecom and internet service providers in Panama are entrusted with the personal data of millions of users. Digital rights organization IPANDETEC has evaluated how well companies have lived up to their responsibilities in ¿Quien Defiende Tus Datos? (“Who Defends Your Data?”) reports.

Deeplinks Blog by Bill Budington | October 8, 2024

FTC Findings on Commercial Surveillance Can Lead to Better Alternatives

The significance of the FTC staff report comes not only from the abuses they have meticulously documented, but the policy and technological possibilities that can follow from the willingness to embrace alternatives.

Deeplinks Blog by Svea Windwehr | October 7, 2024

Germany Rushes to Expand Biometric Surveillance

Germany is a leader in privacy and data protection, with many Germans being particularly sensitive to the processing of their personal data – owing to the country’s totalitarian history and the role of surveillance in both Nazi Germany and East Germany.

Deeplinks Blog by Guest Author, Erica Portnoy | October 1, 2024

How to Stop Advertisers From Tracking Your Teen Across the Internet

When children turn 13, they age out of the data protections provided by the Children’s Online Privacy Protection Act (COPPA). Then, they become targets for data collection from data brokers that collect their information from social media apps, shopping history, location tracking services, and more.

Related Issues

Privacy

Locational Privacy

Search form

Search form

Anonymization…and Re-Identification?

Location Surveillance

Related Issues

Related Issues

Search form

Search form

Debunking the Myth of “Anonymous” Data

Debunking the Myth of “Anonymous” Data

Anonymization…and Re-Identification?

Location Surveillance

Related Issues

Join EFF Lists

Discover more.

Related Updates

Discover more.

Related Issues

Follow EFF:

Contact

About

Issues

Updates

Press

Donate