Privacy is paramount when it comes to data. You need to be certain that you’re protecting the real people behind the numbers from leaks, hacks and disclosure risks. That means choosing the right technology to preserve privacy while also getting the most you can out of your datasets.
So what are your options? Well, at the most basic level, you have techniques like data generalization, pseudo-anonymization and data masking. These vary in sophistication, but they all boil down to removing or replacing personally identifiable information.
This makes it a bit more difficult to re-identify real people, but certainly not difficult enough to be truly secure. As researchers in Belgium and the UK proved with this algorithm in 2019, it’s perfectly possible to correctly re-identify nearly every real person in an enormous, supposedly anonymized dataset if you have just 15 or more demographic attributes to work with.
Gone are the days where you could just change people’s names or mask a few details and expect that to work. To protect people’s privacy today, organizations that work with sensitive data need to invest in sophisticated privacy-preserving technologies (PPTs).
Data swapping goes a little further by mixing up the data, so that data points that are actually connected to one person are re-attributed to another. This creates a new problem: you mess up the statistical distribution of the dataset and no longer tease out accurate trends, patterns and predictions based on how people actually behave.
Then there’s perturbation and differential privacy, both of which add in random noise to obscure details that need to be kept confidential. Differential privacy is the more effective and sophisticated of the two approaches; it keeps the statistical distribution of the dataset intact while preventing the user from seeing whether a query result used any given individual’s data. The trouble is, it only works on very big datasets, makes it tricky for end-users to interpret results, and can make using the data at all a complicated task.
Which brings us onto the next challenge. It isn’t enough for a PPT to solely protect privacy. You need to remember that you are using the PPT because you need to protect privacy while you actually use the data for analytics and machine learning projects! If it undermines data utility or isn’t scalable, it won’t work.
This isn’t just an issue for differential privacy. It’s also a major consideration if you’re considering using a cryptographic method like homomorphic encryption. Or the Secure Multi-Party Computation (SMPC) protocol, which splits encrypted data between multiple parties. Or the promising new distributed systems approach, Federated Machine Learning (FML), which has been designed for data used for training machine learning models.
Put simply, all these conventional anonymization techniques and PPTs have a problem when it comes to balancing privacy protection with facilitating innovation. This has created a gap in the market that synthetic data is perfectly positioned to fill.
With synthetic data, you use the source production data to generate an entirely new dataset. This new, synthetic dataset retains all the key characteristics, attributes and predictive potential of the original dataset. In fact, statistically speaking, it’s indistinguishable from the real data. But it doesn’t trace back to any real people, meaning that it completely eliminates any disclosure risk.
This makes synthetic data generation unique in its approach to privacy risk. In fact, it’s the only technology that poses zero risks to the real people behind the data… because they don’t exist. It’s also the only technology that goes beyond simply protecting data privacy; it actually gives you a scalable way to generate high-quality new data, fueling innovation going forward.
For a detailed comparison of how synthetic data generation compares to other leading PPTs, check out this article >