Machine Learning

Synthetic Data vs Other Privacy Preserving Technologies

“Privacy is not for sale, it’s a valuable asset to protect.”

Stephane Nappo, VP & Global CISO, Groupe SEB

TL;DR: Conventional anonymization techniques and PPTs struggle to strike a balance between robust privacy protection and fostering agility and innovation, creating a gap that synthetic data is uniquely placed to fill.

The Bottom Line:

Did you know that, in 2019, researchers in Europe built an algorithm capable of re-identifying real people in any anonymized dataset that has just 15 demographic attributes?

Gone are the days where you could just change people’s names or mask a few details and expect that to work. To protect people’s privacy today, organizations that work with sensitive data need to invest in sophisticated privacy preserving technologies (PPTs).

But that’s not all. The PPT you use can’t just protect privacy. It also has to support your machine learning and data analytics projects. It can’t interfere with data utility. It has to be scalable.

That means you need to think very carefully about whether to go down the road of encryption, differential privacy, or a distributed systems approach. Or whether to eliminate the privacy risk completely by using artificially generated, synthetic data.

Under the Hood:

When “Robert Galbraith” released their first novel, A Cuckoo’s Calling, in 2013, it took the world no time to figure out that this was, in fact, superstar children’s author J.K. Rowling. Even with no other personal identifiers to look to for clues, the switch of name and gender wasn’t enough to fool a sensitive reader, attuned to Rowling’s writing style. And it was certainly no match for AI-powered language analysis, which had been trained to identify her unique linguistic signature.

Masking a person’s true identity is a complex business; one that even the most robust anonymization techniques and privacy preserving technologies (PPTs) are ill-equipped to perform. There are simply too many clues and signatures in the data that can be traced back to a real person.

In fact, you don’t need that much contextual data to reidentify or reverse engineer an anonymized dataset. In 2019, researchers in Belgium and the UK developed an algorithm that correctly re-identifies nearly every real person in any anonymized dataset with just 15 or more demographic attributes.

It’s far from the only method that works. One of the researchers involved in this study had previously found a way to re-identify, with 90% accuracy, a dataset of 1.1 million people based on 3 months of credit card metadata. Using just four spatial-temporal points to analyze human mobility traces, he also re-identified 95% of individuals in an anonymized smartphone location dataset of 1.5 million users. Clearly, it’s astonishingly easy to crack the code and figure out who real data belongs to.

So are there any PPTs that can honestly live up to the name? And where does synthetic data fit into the mix?

Let’s take a look at 7 of the most commonly used anonymization and privacy preservation techniques, from the simplest solutions through to more sophisticated approaches.

1. Data Generalization

Data generalization seeks to anonymize personally identifying data without undermining overall accuracy by making some details less specific. For example, this could involve deleting zip codes or the first-line of a person’s address, but retaining their hometown. However, making these details vaguer could potentially weaken analysis later on. Plus, by cross-referencing the dataset with others that fill in the gaps, it would potentially still be possible to identify individual people.

2. Pseudo-Anonymization

This is another simple approach that replaces names with pseudonyms, or other identifying data with fake details. It’s handy if you’re using the data for some straightforward, in-house applications, but it’s certainly not enough to ensure against re-identification.

3. Data Masking

With data masking, you create a mirror version of your dataset or database and then modify some elements to mask the true values. For example, some words and characters might be shuffled, replaced with symbols, or encrypted. This is handy as it prevents reverse engineering, but again, you lose granular detail that may have been useful for machine learning models. Also, someone who is really determined to re-identify individual people may still be able to work with the information that they can see. And, of course, as with any form of encryption, you still need to worry about what happens if the wrong person gets hold of the encryption key – including the risk of internal attacks by individuals that have come by this key legally.

4. Data Swapping

Data swapping takes things up a notch by applying shuffling and permutation techniques to switch attributes in the data. For example, personally identifying data like home addresses or dates of birth could be swapped around. This means that individual records no longer relate to one specific person, rendering them anonymous and protecting privacy. However, the sensitive data is still there, it’s just not connected to the right person – and some of that data, like addresses, could potentially be exploited in any format. It also means that the statistical distribution of the dataset in its new form, and any predictive patterns you draw from it, may now be misleading.

5. Homomorphic Encryption

The idea behind this cryptographic method is that you can process and manipulate the data without the values being revealed. This means you can perform some analytics functions despite the data remaining secure and user privacy intact. However, the scope of operations you can actually perform on the data is pretty limited. If you want to apply advanced algorithms, you’ll need incredibly complicated homomorphic encryption, which will likely be far too inefficient and unwieldy for practical use.

What’s more, this method can only be useful once you’ve reached the stage where you know what calculations you want to preserve in the encrypted format. If you’re still in the exploration phase for the data, where you’re making sense of its contents and trying out different methods and models to manipulate it, this won’t work. Once it’s encrypted, you won’t be able to play freely with the data or develop a deep understanding of how it all fits together.

6. Perturbation

Perturbation alters the dataset by adding random noise to numerical attributes that need to be kept confidential, while leaving other information in its original form. Analysts and data scientists can access aggregate statistics such as averages and correlations from the full database, but the individual identity of the record is protected. The trouble with this approach is that it can lead to data mining biases, especially when working with very large datasets or performing complex computations. This undermines confidence in predictive models and insights derived from the dataset.

7. Synthetic Data Generation

Synthetic data takes a completely different approach to anonymization. It doesn’t try to obscure, modify, and/or encrypt the underlying data at all. Instead, using machine learning, a model artificially generates new data. This is based on the real-world production data, retaining all of its key characteristics, attributes, and correlations. But while it is statistically indistinguishable from this real dataset (and should lead to the same insights and predictive models), the data no longer traces back to real people. This means you can utilize, share, and transfer the synthetic dataset however you like, without any disclosure risks.

Final Thoughts: How does Synthetic Data Stack Up Against the Competition?

Until now, PPTs and anonymization techniques have approached the problem of data privacy in largely the same way. They achieved their goals with varying degrees of success, but ultimately, they were concerned with the same problem: how do you make sure real people can’t be identified from looking at this dataset?

Synthetic data takes a completely novel approach. Rather than modifying or disguising the dataset, it replaces it – and in doing so, removes the problem of identification entirely. Done right, this will go much further than simply eliminating privacy concerns. It also frees up the data, facilitating greater collaboration and innovation. It overcomes scalability issues by allowing you to generate more, high-quality data as and when you need it for training purposes. By allowing you to adapt parameters, it even gives you more scope to address underlying biases in the data, leading to better results and fairer outcomes.

What’s more, while real historical data can, by its nature, only look backward, synthetic data generation helps organizations to look to the future. It drives predictive analytics that is highly responsive to a fast-changing on the ground. It helps you to stress-test your systems, based on your forecasts. This next-generation technology doesn’t just preserve privacy, it protects your future, too.

Want to see for yourself how synthetic data generation stacks up to the alternatives? Schedule a demo here >