Synthetic Test Data Vs. Data Masking

AI/ML

Synthetic Test Data Vs. Data Masking: What Are the Main Differences?

Dr. Sigal Shaked

Once upon a time, using authentic production data in a test environment was a no-brainer for most developers. In the wake of privacy regulations like GDPR, though, it’s simply no longer an option. You can’t take those kinds of risks with real people’s data – and if you get caught flouting the rules, you’ll be in a lot of trouble. A lot of expensive trouble.

There are two main ways to get around this problem: data masking and creating synthetic data. Let’s take a look at how these work, the differences between them, and when you might choose to use each one.

What is Data Masking?

Data masking involves replacing parts of confidential or sensitive data with other types of information, making it harder to identify the real data or people it links back to. The term encompassed several approaches, including anonymization, obfuscation and pseudonymization. 

Data masking can take several forms. For example, you could replace names and personally-identifying details with different characters or symbols. You could swap certain details around, or randomize items like dates, names and account numbers. You could also scramble, null out, delete or substitute parts of the data. At the most sophisticated level, encryption makes it mathematically impossible for a malicious actor to unlock the source data. 

However, there are plenty of downsides to masking data, as become clear when you start to compare synthetic data vs other privacy-preserving technologies. Firstly, apart from encryption, none of these methods are watertight. There is always a risk that someone will succeed in reverse engineering or re-identifying the actual people that the data relates to, causing a major privacy and security breach.

Meanwhile, the trouble with cryptographic methods of data masking is that these interfere with usability. Sure, you’ve created a wall around your data that no one can break through. But what does that mean for your own machine learning models? For your own ability to explore and manipulate the data for data science and predictive analytics? 

This is the crux of the problem. Most data masking falls into one of two categories: either it’s not secure enough, or it’s so well-secured that it’s totally unwieldy and impractical for complex AI or software development.

What Are Synthetic Datasets?

Synthetic test data takes a very different approach. Instead of applying various layers of privacy protection to your original dataset, you use a deep learning algorithm to create an entirely new dataset. 

This resulting dataset is statistically identical to the original, containing all the same features and correlations. As such, it will deliver the same predictive insights as the “real” one. The difference, though, is that the synthetic test data can’t be traced back to any real people, because they don’t exist. 

That means you can use your synthetic dataset in exactly the same way as you would your original dataset, but without any disclosure risks. For organizations like banks, that work with highly sensitive, carefully regulated forms of data, this is something of a revelation. Real financial data needs to be carefully controlled, limiting its potential. But you can feed synthetic financial data into machine learning models, share it inside and outside the company, or even repackage it for sale, all without falling foul of regulations or putting anyone at risk.

Synthetic Test Data vs. Data Masking

Synthetic data is fast growing in popularity because it bypasses privacy concerns without incurring the issues that come with data masking. You don’t have to sacrifice detail and specificity in the data in order to make it anonymous. You don’t have to swap data around in ways that could interfere with its true meaning and patterns, leading to inaccurate results. You don’t apply layers of encryption that make it hard to use the data or to interpret the results. 

If you’ve been trying to work out how to generate test data for brand new models and applications without causing you compliance headaches, this is an excellent solution. You can, essentially, regain the freedom to exploit your internal data however you like. You can combine it with external datasets to get a fuller, more nuanced picture without any disclosure risk. You can use it to develop and test lucrative new products. 

So is data masking now obsolete? Not quite. Data masking is a quick and easy fix, achievable with a single line of code. If you just need an instant or temporary solution to make your stored data GDPR or CCPA compliant, this is probably enough. 

Figuring out how to generate synthetic data, on the other hand, is a more involved process. It’s something you’re likely to do as part of a larger AI initiative, to create testing data for machine learning models or QA processes, or so that you can share data externally for collaboration with partners and vendors.

Final Thoughts: What Do You Want From Your Data?

Ultimately, whether you should opt for data masking or synthetic test data boils down to what you want to do with it next. Are you just adding a bit more security to datasets that are already safely stored on-premise, and which you don’t need to share or collaborate on in the near future? In that case, data masking is probably enough. 

Or are you looking for ways to free up your data so you can actually put it to use? Are you keen to embark on ambitious new machine learning projects, without waiting around for months on end for approval? Do you need to extract the most value possible from your datasets, keeping all the nuanced detail intact, to really drive predictive analytics?

If it’s the latter you’re interested in, data masking just won’t cut it. What you need, without a doubt, is synthetic data.