Understanding privacy risk with k-anonymity and l-diversity

Imagine you’re a data analyst at a global company who’s been asked to provide employee statistics for a survey on remote working and distributed teams. You’ve extracted the relevant employee data, but sharing it as-is could violate privacy laws. How can you anonymize the data while ensuring it’s still useful? In this article, you’ll learn about k-anonymity and l-diversity—two valuable techniques in privacy engineering to help you reduce the privacy risk in datasets.

Before you continue reading
Data anonymization is a complex topic that’s difficult to accomplish in practice. This article aims to give you a basic understanding of two commonly used data anonymization techniques. Using these techniques, however, does not guarantee compliance with regulations such as GDPR.

First, let’s look at the data you’ve extracted from your internal HR system:

full_nameemailcountrytenure_yearsdepartment
John Smithjsmith@…USA5Sales
Maria Garciamgarcia@…USA3Marketing
Yuki Tanakaytanaka@…Japan7Engineering
Hans Muellerhmueller@…Germany2Finance
Sarah Johnsonsjohnson@…UK5HR
Pierre Duboispdubois@…UK3Sales
Li Weilwei@…China7Engineering
Anna Kowalskiakowalski@…USA2Marketing
Eva Schmidteschmidt@…Germany5Finance
Priya Patelppatel@…UK3HR

The data identifies individual employees by name and email, so sharing it with a third party may violate privacy laws.

A first attempt at anonymization

Fortunately, the survey partner doesn’t need this level of specificity. Let’s start by removing all fields that directly identify individual employees, such as full_name and email.

Next, we’ll attempt to further de-identify the individuals by aggregating individual rows into groups.

countrytenure_yearsdepartmentscount
USA0-5Sales, Marketing3
Japan6-10Engineering1
Germany0-5Finance2
UK0-5HR, Sales3
China6-10Engineering1

There, we’ve removed the names and emails—but is new dataset truly anonymous? Assume that an attacker knows that Yuki Tanaka lives in Japan. With this dataset, they could infer that Yuki works in Engineering and roughly how long he’s been with the company. Even though a dataset doesn’t directly identify an individual, they may still be identifiable.

Quasi-identifiers
Attributes that don’t uniquely identify an individual on their own but may, when combined with other attributes, are called quasi-identifiers. In our example, attributes like country, tenure, and department are quasi-identifiers. They don’t directly identify a person, but when combined, they might narrow down the possibilities enough to uniquely identify someone.

It’s clear that our first attempt isn’t enough to reasonably protect the privacy of the employees. Next, we’ll see how we can further reduce the privacy risk with k-anonymity.

K-anonymity

K-anonymity is a data anonymization technique ensuring that for each combination of quasi-identifying attributes (such as country and tenure), there are at least k rows that share those exact values.

The k is a number we choose. For example, if k=2, the data is considered to be 2-anonymous. A higher k value provides more privacy.

Example with k=2

By setting k=2, we’re saying that each combination of quasi-identifiers must appear at least twice in the dataset.

Here’s how our data looks after applying k=2 anonymity:

countrytenure_yearsdepartmentcount
USA0-5Sales, Marketing3
UK0-5HR, Sales3
Germany0-5Finance2
Other6-10Engineering2
For k = 2, each group contains at least two employees.

For k = 2, each group contains at least two employees.

Notice how Japan and China are now grouped as “Other”. Since both of them only have one employee, we had to combine them for all combinations to have a count of 2 or more.

Example with k=5

If a higher k means improved privacy, why don’t we just set it to a really big number? Let’s see what happens when we increase k to 5—each combination of quasi-identifiers appearing at least five times:

continenttenure_yearsdepartmentcount
Europe0-5HR, Sales, Finance5
Other0-10Engineering, Sales, Marketing5
For k = 5, only two groups remain.

For k = 5, only two groups remain.

It’s now much harder to identify individual employees in the dataset. But as a result, we had to remove so much information that it lost its usefulness. Any employee outside of Europe is grouped under “Other”, and the range of tenure becomes so big that we may as well remove it altogether.

Lower k values increase specificity, while higher k values offer better privacy protection. The best value for k depends on your dataset and how sensitive your data is.

L-diversity

Let’s assume we choose k=2 so that all combinations occur at least twice. Unfortunately, it still fails to protect employees in Germany, Japan, and China.

Since all the employees in Asia work in Engineering, we’ve done little to protect Yuki’s privacy. You may also have realized that it doesn’t actually matter how many employees are in the Other group if all of them are in Engineering. The problem isn’t the size of the group, but the lack of diversity in the departments column.

This is where l-diversity comes in. L-diversity builds on k-anonymity to provide even more privacy by ensuring a given level of diversity in sensitive attributes within each group.

The Germany and Other groups aren’t diverse enough for l = 1.

The Germany and Other groups aren’t diverse enough for l = 1.

L-diversity ensures that there’s sufficient diversity within each combination of quasi-identifiers. For example, l=5 means that within each group, there are at least 5 well-represented values for the sensitive attribute (in this case, department).

What is a well-represented value?
Note that l-diversity doesn’t necessarily mean l unique values in each group. For example, if the UK had 9 employees in Sales and only one in HR, an external actor could guess with 90% confidence that an employee works in Sales.

In our example, one way to achieve l=2 is to move all employees in Germany into the Other group. Note that, as a result, the dataset is now 3-anonymous.

countrytenure_yearsdepartmentcount
USA0-5Sales, Marketing3
UK0-5HR, Sales3
Other0-10Finance, Engineering4
k = 3 and l = 2

k = 3 and l = 2

Now, each group has at least two different department values, satisfying l-diversity with l=2. Even if someone knew Yuki was based in Japan, it’s no longer trivial to deduce that he’s in Engineering. Unfortunately—just like for k=5—the tenure range (0-10) is likely too wide to be useful.

As you can see, setting the values or k and l involves balancing data utility against privacy protection. It also illustrates an important point: using k-anonymity and l-diversity doesn’t guarantee the absence of privacy risk.

Limitations and considerations

K-anonymity and l-diversity allow us to communicate privacy risk in a dataset. However, as we saw throughout the article, these techniques have limitations:

  • Homogeneity attacks: An attacker can still infer sensitive information if all sensitive values within a k-anonymous group are the same.
  • Background knowledge attacks: Additional information might allow attackers to narrow down possibilities.
  • Skewness attacks: Even with l-diversity, if one value is much more frequent, high-probability inferences are possible.
  • Similarity attacks: If the sensitive values in a group are semantically similar, it may still allow harmful inferences.
  • Data utility trade-off: As we increase privacy protections, we often lose some of the data’s usefulness or specificity.

In practice, it may be impossible to completely eliminate the risk of re-identification without rendering the data useless in the process. The best way to remove privacy risk is to avoid collecting or sharing the information in the first place.

Learn more

To learn more, I recommend the Data Privacy Handbook by Utrecht University. In addition to k-anonymity and l-diversity, it also covers t-closeness—another technique to further reduce privacy risk—along with videos on each technique.

ARX is an open source tool for anonymizing sensitive personal data. It supports a range of data anonymization techniques, including k-anonymity and l-diversity.

Also, if you’re using Google Cloud Platform, check out their Sensitive Data Protection which lets you compute k-anonymity and l-diversity on your datasets.

Conclusion

As data professionals, we need to balance privacy risks and data utility when sharing sensitive data. K-anonymity and l-diversity are two data anonymization techniques that can help you reason and make conscious decisions about the privacy risks for a dataset.

Unfortunately, it may be close to impossible to guarantee that an attacker won’t be able to re-identify individuals in a dataset. Data anonymization techniques should only be used as one part of a larger, comprehensive privacy program.

Edit 2024-11-11: Added ARX to the list of resources. Thanks to FjordWarden for the recommendation!

Learn more from these articles

Tutorial

Get started with Fides

Learn how set up Fides, an open-source privacy-as-code platform, to create and respond to DSRs for a e-commerce sample application.

Comments

Leave a comment by replying to the post on Bluesky or Mastodon.