“AI is going to be extremely beneficial and already is, to the field of cybersecurity. It’s also going to be beneficial to criminals.”
Dmitri Alperovitch, Co-Founder of Crowdstrike
The Bottom Line:
There’s a war going on in AI… and I don’t mean a Terminator-style rise of the machines.
It’s the war between privacy and progress.
AI demands data. So much data. Machine learning models and neural networks need a ton of information to train, test, and deploy. And not just any data. They need nuanced, accurate, quality data. Data with detail and context.
But using and sharing these kinds of datasets poses an enormous threat to user privacy. In the wake of multiple breaches and abuses of this privacy, regulators across the world have clamped down on what organizations can do with the data they hold, how long they can keep it, and where they can store it.
This is causing massive headaches for AI developers. Too often, regulations are seen as strangleholds on innovation. As an existential threat to the future of AI. Or as rules to be challenged and circumvented as best you can.
But it doesn’t have to be that way. Privacy and progress don’t need to be at war.
With the right technology, you can offer better-than-ever privacy protection while making your AI workflows better, faster, and more cost-effective than ever.
Under the Hood:
Ever noticed how much more accurate and responsive the Google Maps app is than the Apple Maps app on your smartphone? That’s because Google collects and keeps far more location data on users – once they’ve given their permission, of course. The tech giant has few qualms about collecting, collating, and analyzing information from everywhere you access its services, from Chrome to Google Search to YouTube to Maps to the Assistant app.
We all know that Google Maps takes more of our data than Apple. But millions of us keep using it because it’s a better product.
For the AI industry, this underpins the fundamental problem with privacy protection. No matter how firmly you believe that individual privacy rights must be enshrined, you need a lot of high quality, accurate, nuanced, carefully contextualized data to do your job. Machine learning-backed predictive models and deep learning neural networks need training data.
From our cars to our phones, our kitchens to our workplaces, AI is everywhere. The industry is growing year on year, and is expected to be worth $126 billion by 2025. But as data privacy regulations get more stringent all the time, many in the industry fear how these privacy rules will affect their future.
AI and the Threat to Privacy
Data privacy has long been the thorn in the side of AI projects. Anonymization simply isn’t enough to safeguard identities, as multiple experiments have shown. Combining datasets makes it all-too-easy to re-identify individuals, undermining their right to privacy and potentially putting them at greater risk of identity theft and other crimes.
Machine learning typically requires data from myriad sources to be brought together and fed into the model, driving intelligent, well-informed decision-making. But as you increase your ability to analyze huge, diverse, unstructured datasets from multiple sources, you increase the privacy risk, too.
Moving data from different sources into one centralized, sprawling Big Data depository makes it easier for a bad actor to reform data points into real people. Linking between sources can inadvertently disclose a person’s behavior patterns. Combining non-personal data with anonymized personal data can lead to leaks of confidential information.
These challenges lead many companies to take a cautious approach. Rather than embracing predictive data analytics and steaming ahead with ambitious AI projects, they leave their data siloed and dormant. A Deloitte study found that 56% of AI-adopters worry about privacy breaches, pinpointing this fear as the primary barrier to AI initiatives. The specter of flouting regulations makes many organizations wary of working with third-party vendors, too, limiting opportunities for collaboration and growth.
Privacy Regulations Around the World
To help protect people from re-identification and breaches, privacy advocates in Europe and North America have successfully lobbied for stricter data privacy regulations. This protects user rights and will (hopefully) thwart cybercriminals, but it also creates more red tape for AI developers hoping to create better, data-rich products.
The two most important and far-reaching sets of regulations are the European Union’s General Data Protection Regulation (GDPR) 5 and the California Consumer Privacy Act (CCPA). Both are designed to limit commercial use of Big Data, prohibit the exposure of personal data without user consent, and define consumer rights when it comes to their confidential and personal data. Companies that break the rules receive hefty fines.
- General Data Protection Regulation (GDPR)
GDPR demands that companies secure the consent of users and website visitors to use their data. Consent must be granted for each, specific purpose. Although this is an EU directive, it applies to every company or site that captures data from people in Europe. In practice, you need to comply no matter where you are in the world, or you won’t be able to process the data of anyone in the EU.
GDPR means you need explicit or implicit permission to use any user data you collected in the past, too. Companies that couldn’t prove this lost the ability to use thousands or millions of records overnight.
You can imagine the disruption legislation like this causes to AI developers and machine learning projects that rely on historical data. It shows you need to be cautious about relying too much on data assets that include personally-identifying information.
- California Consumer Privacy Act (CCPA)
CCPA is similar to GDPR in that users can choose to refuse sites and platforms the right to collect and use their data. They can also request that their data is removed from any existing datasets.
Like GDPR, this is only legally binding for people in its jurisdiction area, i.e. California. However, unless you want the hassle (and risk) of excluding California residents or siloing all your data / handling it differently depending on geolocation, in reality, all companies collecting data from U.S. citizens need to comply.
Privacy-Enhancing Technology (PET) & Synthetic Data Generation
These challenges have led to a proliferation of PETs, designed to encrypt or perturb data to protect privacy. You can learn more about these here.
The trouble is that few PETs are flexible enough to meet the needs of AI developers. To maintain the integrity and structure of the encrypted data, they typically limit the complexity of mathematical operations you can perform.
A more AI-friendly alternative that’s rapidly gaining traction is synthetic data. This artificially generates a new dataset from underlying, real-world production data, maintaining key properties like distribution.
There’s no direct connection between the synthetic dataset and the raw one, so there’s no personally identifiable information. This makes synthetic data a risk-free way to ensure data privacy. Developers can use this data however they like, safe in the knowledge that they won’t breach any privacy regulations.
They can share the data with internal and external partners, pooling knowledge, and collaborating on projects. They can move data across borders and host it on whatever platform or server they like.
This removes a major barrier to AI innovation at a stroke. But synthetic data goes further. Alongside privacy protection, it fosters an AI-friendly development environment by:
- Creating a readily available source of data for testing, QA, and development of models and AI applications.
- Addressing data scarcity issues that some sectors struggle with, either because the costs and privacy risks of data collection are high, or because it’s such a new industry.
- Producing large volumes of data at high speed and at a far lower cost than acquiring real-world information. Developers can fail fast and move forward quickly with promising projects and accelerate time to market.
- Ensuring that the training data you generate is of the highest quality: properly, accurately, and consistently labeled. This is the kind of training data you need for systems driven by sophisticated AI, like deep learning neural networks.
- Generating Big Data-sized synthetic datasets from modestly-sized customer datasets, making this a highly scalable approach.
- Allowing you to tweak the parameters of synthetic data. Unlike with real-world data, you can model future or hypothetical scenarios rather than only looking to the past. This is potentially far more useful to AI projects than stale, out-of-date historical data. It means you can design datasets guaranteed to integrate with one another, avoiding bottlenecks in your development pipeline.
Privacy regulation is there for good reason. People have a right to privacy – and technology companies must do their utmost to honor this, rather than railing against it. Embracing a state-of-the-art PET like synthetic generation means you don’t have to choose between privacy and AI innovation; in fact, it fosters both at once. The future of AI depends on it.
Want to see for yourself what synthetic data generation can do for you? Schedule your demo here >