Synthetic data

Want to Innovate With AI? Start with the ABCs of Data Strategy

Roy Yogev

You couldn’t write a bestselling book before you’d grasped the alphabet – and you can’t build a model to revolutionize your industry until you’ve tackled the fundamental ABCs (and Ds and Es) of data strategy. Before you embark on any machine learning project, you need to make sure your plans are:


Backed up by the Right Skills


Disclosure-Risk Free


Let’s take a closer look at each of these in turn. 


How quickly can you respond to emerging opportunities? If you saw a potentially lucrative gap in the market, would you be able to mobilize your team to develop, test and deploy a machine learning model without delay? Or would it take you months just to get clearance to use your own datasets? 

If your data is locked up in silos, or access is restricted by GDPR and other privacy regulations, you won’t be able to move on a new project fast, with the flexibility you need to innovate. If you can’t get sign-off to collaborate internally (or externally) on data projects, you’ll lack the freedom to experiment with exciting new ideas, technologies and partnerships. If you’re obliged to keep valuable data locked-down on-premise, rather than stored and shared in the cloud, this limits the range of applications and approaches you can use. 

To mine the potential of AI, your data strategy has to address issues early on. You need to focus on making your data available and your processes as agile and streamlined as possible. For example, using synthetic test data rather than original datasets means you can access, use and share data to train and test machine learning models without the same limitations, paperwork or delays.

Backed up by the right skills

Before you can steam ahead with any AI strategy, you need to establish the IT infrastructure to support complex data pipelines. This can be a logistical nightmare – especially if you’re dealing with high-volume, unstructured datasets, spread across a variety of databases. 

Rolling out top-notch technologies is only the beginning. Long-term, you’ll need to maintain, manage and upgrade them, handling connections with other systems and understanding exactly where any security vulnerabilities might be. 

There’s a lot that can go wrong. Privacy and security breaches. Failures. Unplanned downtime. As tech issues arise, they’ll have to be dealt with quickly, with the right expertise, to prevent this slowing down AI development and testing. 

If you don’t have the right skills to do this in-house, you have two options. You can bridge the gap by hiring highly qualified, highly sought-after (and highly paid) IT professionals, or you can automate and outsource as much as possible to external vendors. If you choose the latter, you need to make sure you’re not breaking any rules about storing sensitive data on the cloud or moving it to servers overseas. 


Once you build these data pipelines and processes, your team will quickly come to rely on them. Your customers and partners will also depend on any machine learning-driven products, tools and apps you offer. This makes it absolutely vital that your platforms, technologies, processes and data flows are all reliable, robust and utterly resilient. You need to be certain that underlying systems are stable and won’t be overloaded. You need to plan ahead to avoid bottlenecks. You need to ensure that your strategy guarantees an uninterruptible supply of data. 

What happens if rules on fair use of data change in the future? What if regulations are tightened to reduce the time you can hold on to historical data? How will this affect your data use or data generation for machine learning? You must tackle these questions before you start.

Disclosure Risk-Free

Privacy breaches are a nightmare for any organization, but if your company handles sensitive data, they’re an absolute disaster. Reputational damage amounts to a serious crisis for banks and financial institutions that depend on customer trust. The direct costs of addressing the leak, beefing up security, paying compensation and so on can quickly spiral, too. If you’ve fallen foul of privacy regulations in your handling of the data, you could incur tens of millions of dollars in fines – or even face a class-action lawsuit. 

All of which makes it imperative that you do everything you can to remove the risk of a privacy breach. Many Privacy-Enhancing Technologies (PETs) exist, but anonymization and data masking aren’t enough. Reverse engineering and re-identification are a constant threat, even with relatively sophisticated techniques. A more effective approach is synthetic data generation, which involves creating a new, artificial dataset that’s indistinguishable from the original, but doesn’t link back to any real people, eliminating the disclosure risk.


Last but not least: is your strategy scalable? Machine learning requires swathes of accurate, relevant, recent data to test and train models. How will you keep up with this demand, especially if privacy issues prevent you from using production data? 

This translates directly into technology choices. Sure, you could protect your data with cutting-edge PETs like homomorphic encryption, the Secure Multi-Party Computation protocol (which divides up encrypted data between several different parties), or Federated Machine Learning. But while these offer excellent security, they make it very difficult to actually use that data for AI, especially at scale. Anything designed to make it harder to understand and model your data will likely undermine quality and nuance. As you expand, these problems will become more pronounced. 

Final Thoughts

A successful data strategy for AI innovation relies on careful planning and anticipating pain points and risks before they arise. You’ll need workable, watertight solutions that can grow with you, rather than increasing problems later on. 

Far better to take more time at the start choosing the best technologies and setting up the right pipelines than to storm ahead with the wrong approach because, say, you don’t yet know how to generate synthetic data. Once you’re entrenched in a less-than-perfect system, it’s tricky to pivot. Dealing with the fundamentals now opens up scope for innovative, competitive initiatives later on.