Synthetic Data Generation Best Practices

Synthetic data

Best Practices for Synthetic Data Generation

Dr. Sigal Shaked

Machine learning continues to revolutionize operations and products across all businesses and sectors, opening up lucrative opportunities and revenue streams. As it does so, more and more companies that found themselves locked out of complex data projects for compliance reasons are fast waking up to the value of synthetic data for AI

Or, at least, they’re waking up to one of its valuable uses: privacy protection. Companies that deal with sensitive data are bound by serious data privacy and security rules, as well as limits on what they can do with the data they collect without the clear permission of the people they collected it from. Failure to adhere to these regulations can mean eye-watering fines and serious reputational damage. 

Artificial data generation bypasses these concerns by creating a new dataset that retains all the statistical patterns of the original data, except that all the data is synthetic. It doesn’t lead back to anyone real, so the disclosure risks associated with using it essentially evaporate.

This is great for companies looking to storm ahead with ambitious machine learning projects without delay. However, synthetic data for AI has other enormous benefits too – provided you apply best practices to your data generating process. 

What Are the Synthetic Data Generation Best Practices?

Highlight Negative Scenarios for Testing

One common challenge in early-stage machine learning projects is getting the right kind of test data for QA processes. If you are finessing the model, ironing out any final problems and pre-empting any strange behavior, most of your production data will follow the hoped-for path. There won’t be that many mistakes or issues for you to correct. You simply might not have a large enough volume of data that illuminates negative scenarios in order to properly test your model or product. 

With synthetic test data generation, you aren’t limited by this problem. Before you start the data generating process, you would be able to select segments of your production data (i.e. the negative scenarios) and create larger amounts of synthetic test data based on this subset. This allows you to scale up the amount of data you need to train your model, even from a relatively small source.

Don’t Remove Outliers and “Edge” Cases

In a normal, “real” dataset, outliers pose a privacy problem. The very fact that something stands out as being unusual or unique makes it easier to re-identify the person the data pertains to, increasing the risk of disclosure. For this reason, many organizations using their own data for AI projects will strip out outliers as part of the data preparation process before anonymizing the data.

The trouble with this approach is that those outliers are potentially extremely important for training your machine learning model. They can help to establish nuance, encouraging you to consider new factors in predictive analytics. They can reveal bugs and rare functionality issues in your AI-driven products. You really shouldn’t ignore them unless you have a very good reason to disregard these results. Making your dataset more homogenous so that interesting insights blend into the mass is not a very good reason. 

This isn’t a problem when you’re dealing with artificial data generation. You don’t have to disguise outliers in order to protect anyone’s identity, because there is no actual person to protect. When you’re setting the parameters for your synthetic data generation tool, you can reproduce the statistical variation of the original dataset exactly – outliers and all. This should lead to far more accurate models. 

Final Thoughts: Choose Your Tools Wisely

Whatever platform or tools you use for data generation for machine learning, it’s vital that they are up to the task. 

Is the quality of the synthetic test data created good enough for your AI and ML projects? Does the system pinpoint the most important features in the source data, synthesizing a new dataset that truly preserves the characteristics and behavior of the original data? That includes event sequences, feature distribution, correlations between characteristics, and entity relationships. 

What’s more, does this solution meet all your compliance requirements? Is it in line with GDPR, CCPA, HIPAA and any other regulations you need to follow? Sure, the resulting synthetic dataset will (or should) be privacy-risk-free, but what happens to your underlying data? Can the vendor assure you that this can never be exposed while the synthetic data is being generated?

And importantly, does the provider have the deep-level business and technical expertise to understand what you need from this data and how you will collaborate on it, or share it with fintechs and other third parties?

Not all synthetic data technology providers are created equal. The first “best practice” to follow is, always, ensuring that you and your vendors are on the same page.