Synthetic Data for Privacy Preservation

Machine Learning

The Importance of Synthetic Data for Privacy Preservation

“If you put a key under the mat for the cops, a burglar can find it, too. Criminals are using every technology tool at their disposal… If they know there’s a key hidden somewhere, they won’t stop until they find it.” 

Tim Cook, CEO of Apple

TL;DR: Anonymization is no longer enough to guarantee data privacy. To ensure that your data-driven activities put no individuals at risk, you need realistic data that doesn’t link back to real people.  

The Bottom Line: 

Remember when AOL caused panic by publishing the complete internet search histories of 650,000 people? Until the (swift) backlash, the internet service provider simply hadn’t considered how easy it would be to figure out a person’s identity based on clues from their search behavior. How wrong they were. 

Thankfully, today’s tech giants, financial institutions, and other organizations that work with sensitive data understand deeply that basic anonymization isn’t enough to protect people’s privacy. They’re also bound by a swathe of regulations that limit what they can do with real people’s data. 

But there’s only so much you can do. Bad actors are always lurking, looking for ways to steal data, or join up the dots in anonymized models. 

Meanwhile, sophisticated technologies like homomorphic encryption and differential privacy make privacy breaches mathematically impossible – but at the expense of innovation. 

Little wonder, then, that synthetic data is fast becoming a cornerstone of data privacy. By disconnecting the distribution, behavior, and qualities of original, raw data from actual people, synthetic datasets preserve all the value and potential of high-quality datasets with none of the privacy risks. 

Under the Hood:

Cast your mind back to August 2006. NASA has launched the New Horizons Probe. Shakira’s No.1 hit “Hips Don’t Lie” is all over the radio. Google is in the process of buying YouTube. And AOL has just made public the complete search histories of 650,000 internet users.

This wasn’t an error: AOL meant to release the data. The rationale was that all the names had been anonymized, so no one’s privacy was at risk. But within minutes, there was uproar. Clearly, simply changing a person’s name isn’t enough to protect their privacy. People search their own names to see what comes up. They search the names of people they know. They search for their school, their workplace, the opening times of their doctor’s clinic. It’s not that tricky to piece together someone’s identity from all these clues – and once you’ve cracked it, you would know everything they have ever looked up. You could figure out what bank and insurance companies they use. What medication they’re on. What embarrassing ailments they’d suffered. What NSFW videos they’d watched that they’d hate for their partners, parents, or employers to know about. Anonymization did nothing to protect them against this spectacular breach of privacy. 

AOL realized immediately that they’d messed up and took down the data, but by then it was too late: the data was leaked and the backlash was intense. Other internet giants and companies dealing with sensitive data got the message. If you’re going to handle people’s secrets, you need to be absolutely certain that you can protect their privacy.

The moral of the story is no less true today than it was in 2006. In fact, as data analytics and predictive modeling gets more and more sophisticated, companies have to work even harder than ever to ensure dots aren’t being connected in ways that inadvertently leak a person’s identity. Meanwhile, cybercriminals are getting bolder and smarter. They’re more skilled than ever before at reverse engineering models, at hacking systems, at sleuthing and identifying the real people behind the numbers. 

The fact of the matter is that anonymization simply isn’t up to the task of protecting data privacy. But financial institutions, medical providers, and other keepers of sensitive data know this, only too well. That’s why so many leading companies have embraced sophisticated alternatives that scramble or obscure the underlying data, protecting real people against prying eyes. 

Homomorphic encryption, for example, takes a cryptographic approach. Differential privacy uses randomized perturbation. Done right, either of these should make it mathematically impossible to break into the data and identify real people. 

But what then? What can you actually do with this data? As soon as you try to use it for a complex project – to support and train advanced algorithms, for example – you run into efficiency, utility, and scalability problems. 

Other privacy-protection innovations have tried to get around this problem by splitting up the data in some way. Secure Multi-Party Computation (SMPC) applies different encryption to different parts of the dataset, so that no one party can unlock it on their own. Federated Machine Learning (FML) attempts a similar approach, releasing different parts of the data to different owners, so that they can use it to train machine learning models, before combining these into one model at a later stage. But in both cases, at some point, you have to join things together, and at that moment you risk a weak link. With SMPC, you need to ensure that your lines of communication between parties are also completely secure and unhackable. With FML, you still need to find a way to encrypt data and prevent reverse engineering in the final outcome. 

For as long as the data relates back to real people, cybercriminals are incentivized to steal it. That means you’re fighting an uphill battle. You need to keep making your protection stronger and stronger, disguising and contorting your data in more and more elaborate ways until it’s simply not as valuable or useful to you anymore. 

But there is another option. 

What if that data didn’t lead back to a real person at all? What if there was no one’s identity to steal? No one’s privacy to expose? 

What if your dataset perfectly replicated the structure and features of the underlying, real-life one, providing the same insights and value, but without the privacy risks?

This is the beauty of synthetic data. 

Synthetic data contains no personally identifiable information. It poses no risk to user privacy. It’s not subject to any existing privacy regulations, so you don’t have to worry about complying with the General Data Protection Regulation (GDPR), the

California Consumer Privacy Act (CCPA) or the Health Insurance Portability and Accountability Act (HIPAA). It completely removes the privacy headache. 

And it’s viable. Synthetic data produces quality data sets that can be scaled up as required, and at a very reasonable cost. Smaller organizations can compete with far larger ones, generating datasets and far outsize their original customer ones. Once you’ve built your generative model, producing this data is far more cost-effective than gathering information from the real world. 

It’s also indistinguishable from the real-world data. The distribution, behavior, and qualities of original, raw data are carried through; they just aren’t linked to real people or identifying information. 

The implications for businesses, especially financial institutions, are enormous. You can significantly reduce the costs associated with maintaining data privacy and begin to innovate and experiment more freely and ambitiously with data. You can collaborate seamlessly within the organization and with external partners. You can even speed up your projects right from the outset, fine-tuning parameters for data generation so that the results will integrate frictionlessly with any digital transformation initiatives in your company. 

If you are working with sensitive data, you have an obligation to make privacy paramount in everything you do. By working with synthetic data, you make the problem of privacy risk a thing of the past. 

What Next?

Keen to find out how Datomize’s synthetic data generation protects privacy?
Request a demo here >