The Importance of Synthetic Data for AI Adoption

Machine Learning

The Importance of Synthetic Data for AI Adoption

“We are moving slowly into an era where big data is the starting point, not the end.”Pearl Zhu, Author & CIO

TL;DR:  Businesses looking to unlock lucrative opportunities created by AI in the coming years need to put synthetic data at the center of their data privacy strategy.

The Bottom Line: 

You can’t do AI without data. And not just any data. You need huge quantities of high-quality, accurate data… that’s completely confidential and can’t be traced back to real people.

But how do you achieve the context and specificity you need to generate valuable insights and benefits to the business, without running the risk of exposing anyone along the way?

As more businesses seek to overcome barriers to adopting AI-driven, intelligent data analytics, it’s little wonder privacy-preserving technologies (PPTs) have become vital elements in the AI innovation ecosystem.

But anonymizing or disguising data has limitations. Attacks and leaks are constant risks. Maintaining high levels of protection hinders what you can do with data. Operations are restricted and projects become hard to scale.

Synthetic data, on the other hand, creates realistic datasets that aren’t linked to a real person. 

These are based on rules, patterns and traits learned from real data, generating profiles, and datasets that are indistinguishable from actual ones. There’s no privacy risk, so you’re free to use it however you want, share it with whomever you choose, and scale it up as much as you need. 

Under the Hood

AI innovation is set to skyrocket. 

The EU is investing $20 billion in growing the sector, while governments of the U.S., UK, and China have announced rival, multi-billion-dollar schemes to boost domestic AI markets. 

Forward-thinking businesses are already ahead of the game. Research by Forrester in 2019 found that just over half of decision-makers in the global data and analytics space have already implemented, are currently implementing, or plan to expand/upgrade their AI investments. IDC predicts that three-quarters of enterprises will embed AI into their technology and process development by 2022, and that by 2024, AI will be integral to all parts of the business. 

Within the finance sector, banks, lenders, and insurers are using machine learning to create faster, more accurate, frictionless credit scoring and fraud detection models. To target personalized marketing campaigns with precision. To strengthen their portfolios with AI-driven algorithmic trading and risk assessment. To improve regulatory compliance, cut costs, and optimize processes.

Clearly, AI is the fast-approaching future – and companies that put off the inevitable will sacrifice competitive advantage in the long run. So why are some organizations dragging their heels? What’s holding them back?

AI has a data problem

The biggest hurdle is protecting privacy. 

Building a machine learning model requires swathes of testing data. Even if each data point is anonymized or fragmentary at the start of the process, generating valuable insights and making accurate predictions typically involves combining multiple datasets. 

This provides much more context and nuance, but dots are connected, profiles fleshed out and patterns established along the way. The more diverse data streams you use and the better your models get, the greater the risk that a third party could trace back through the information and identify real people whose data has been used. 

Not only is this a huge breach of privacy, but if that confidential data is financial, medical, or sensitive for other reasons, this could be exploited by fraudsters and criminals. The costs and reputational damage to the business would be disastrous. 

And that’s if you’re even allowed to use your customers’ data for these purposes. 

Data privacy regulation like GDPR, which governs how you can use people’s data and what you need explicit permission for, is getting stricter all the time. As rules tighten, you could find you’re simply no longer allowed to risk using your in-house data as training and testing data for machine learning models and AI projects, even if it’s anonymized. 

If that happens, you will need a viable, high-quality alternative to “real” data. 

Conventional privacy protection technologies just aren’t enough

As we touched on above, privacy-preserving technologies (PPTs) already exist, but the conventional approach is to focus on anonymization or obfuscation. The technology strips out personally identifying information in the hope that it can no longer be linked back to a real individual. 

Common types of PPT that use this approach include:

  • Homomorphic encryption. A cryptographic approach that allows users to perform data analysis on the dataset without actually seeing the values of the data, which stays encrypted the whole time.
  • Differential privacy. Used on very large datasets, this guarantees the privacy of individual people in the dataset by applying systematic randomized perturbation. The idea is that the dataset’s statistical distribution is preserved, even though individual records have been manipulated.
  • Secure multi-party computation (SMPC). This protocol works by splitting encrypted data among several different parties, so that none of them can retrieve all data by themselves. 

But putting all your eggs in the anonymization or data protection basket is far from a perfect solution. There is always the risk that a bad actor will break through and succeed in stealing, leaking, or manipulating the data. Or that observers will simply connect the dots and figure out a person’s identity. 

The more datasets you combine, the more possible routes a criminal could take to link relevant data points through intelligent analysis, making re-identification more and more likely. 

What’s more, the stronger the protection, the more limitations this is likely to place on the analytic process. 

You may be restricted to performing just a few types of operations. This is a particular problem when applying homomorphic encryption, which only lets you apply basic mathematical operations, in order to preserve the structure of the encrypted data.  

Meanwhile, you may find that manually modifying the data to disguise a person’s identity requires you to remove outliers or conceal unusual details, which in turn makes the data appear more homogenous than it really is, intensifying data bias. Maintaining data utility is another major problem, especially when you’re using a method like perturbation. 

The more data you use, the harder it will be to anonymize this data robustly, too. As such, you may struggle to find PPTs that are scalable enough to meet the demands of your expanding analytic capabilities. 

Scalability is also a big problem when using SMPC, since you need streamlined, swift, secure communication and collaboration between all parties involved, which gets harder as the project gets more complex and data-intensive. 

The case for synthetic data

This is where synthetic data comes in. 

With synthetic data, you use machine learning to build a model based on real-world source data, and then artificially generate an alternative dataset. You can then use this data to train and test new machine learning models, to share with partners and third parties, and to develop AI projects generally. 

The key here is that the synthetic dataset you create will mirror the correlations and patterns contained in the original. In fact, it should be impossible to distinguish from real data. 

But the fact that it’s not real means that there’s no privacy or disclosure risk. There’s no real-world individual to link back to; no potential victim of fraud, identity theft, or privacy violations to worry about. But while the people represented by the data aren’t real, using this data performs the same function as “real” data – and because it simulates the features and traits of a real-world dataset, you end up with the same results, without the pitfalls. 

For financial institutions and other businesses keen to embrace the potential of AI, this represents an enormous opportunity. Without privacy hoops to jump through, you have the freedom to start experimenting with synthetic data, uncovering trends, patterns, and insights from your business. 

You can make invaluable predictions about customer behavior and emerging market conditions. You can use the power of machine learning and predictive analytics to reduce costs, maximize resources, and develop lucrative new products. 

Now’s your chance to race ahead and corner the market while your competitors are still scratching their heads, trying to figure out what to do about their data privacy problem.

What Next? Want to see for yourself how Datomize does synthetic data? Please schedule a demo here >