Skip to Content

The new generation of privacy preserving technologies

Rory Potter, Data Scientist, Capgemini Engineering
13 Jun 2022
capgemini-engineering

The foundation of intelligent products is personal data – how do we use that without compromising privacy?

Intelligent products and services face a trade-off between capability and privacy.

To be truly intelligent, these offers need user data on a vast scale, to build and train sophisticated AI.

Health devices capture data on heart rate, diet, genetics, or medical history. Smart energy devices collect data about in-home activities. Assisted driving systems need data on where you are, how you drive, and whether you have (inadvertently of course) exceeded legal speed limits.

This data can be sensitive.

To protect this data, it is encrypted. The problem comes when it needs to be decrypted to train the models that underpin intelligent products. This creates the possibility that private data is revealed to people working on the model, and the possibility for unencrypted data to be lost or stolen.

Keeping data safe is all the more important in the age of AI. With today’s mathematical capabilities, even anonymized datasets can be reverse engineered to identify individuals and draw inferences about their private lives (as Netflix once found).

The data sharing trade-off: more predictive power = more privacy risk

As data and AI skills permeate organisations, it becomes advantageous to share data more widely. The more data that experts can access – and the greater the diversity of people with access to data – the more value data can potentially bring.

But we may not want to risk too many people seeing personally identifiable data, or we may not have permission to share it (old clinical trial results may only have consent to be shared with the research team).

We may also want to share it beyond our organisation. Medical companies may want pool data on disease responses, or utilities companies on energy usage patterns, so they can all develop more predictive tools. But they want to do this without compromising customer’s privacy, or giving away IP.

This trade-off is often presented as ‘we can do data science or retain data privacy, but not both’. But this is a scale rather than a binary choice. Small, focused teams can work safely on unencrypted data. The more that data is shared, the greater the potential benefit, and the greater the risk.

The new privacy-preserving tools of the trade

Rules around protecting privacy are strict, and users will get upset if you abuse their trust.

Deepmind fell foul of this in 2017. It’s Streams app, using NHS data, predicted risk of acute kidney injury. But in building the app, un-anonymised medical records of 1.6 million patients were illegally shared with Deepmind. This life saving tool was eventually discontinued because privacy had not been adequately considered when using personal data to build an intelligent AI product.

So, companies need ways to use private data to build intelligent products, in ways that protect user privacy. Fortunately, a new range of tools are rising to the occasion. We’ll take a quick look at them before discussing ways forward.

  • Federated Learning: This allows a model to be trained on data held across lots of different devices or servers. So, the model learns without ever taking the data off that device or making copies of it. It can be thought of as ‘sharing the model, not the data’, creating a global model which learns from the local ones.
  • Secure Multiparty Computation: This enables multiple parties to work on data they don’t want to share with each other. It shares encrypted data between an agreed set of people within a network and allows them to work on a dataset made up of all party’s private data, without ever seeing the raw data.
  • Homomorphic encryption: This allow data to be processed while encrypted. For example, it would make it possible to find data on people with arthritis from a wearables data set, run calculations on it, and create a useful model based on group-level insights without ever decrypting personal records. Homomorphic Encryption is gaining popularity and it is hoped that one day almost all computation will be done on encrypted data.
  • Trusted Execution Environments: These is a hardware feature that create secure area on a device which can execute certain approved functions in isolation; our smart phones use these for biometric authentication. These could be set up for running AI models on private data without anyone having access to that data.
  • Differential privacy: Even if modelers do not see the raw data,it may still be possible for bad actors to reverse engineer the model’s outputs to reveal personal identities. Differential privacy helps overcome this (and also helps maintain anonymity more  generally) It adds random noise to the data, which corrupt the datapoints, but preserve properties of the overall data set. Because the modeler knows the type of randomness, they can still construct an accurate group-level picture that is reliably predictive. But anyone who steals the data has no idea whether any individual data record is accurate.

What do privacy preserving technologies look like in practice?

These are not just academic concepts, these technologies are starting to be used seriously in real-world applications.

MELLODDY is a consortium of life sciences companies using federated learning to share drug discovery data. By accessing each other’s data, all participants can boost predictive performance of drug discovery models, helping them identify compounds for drug development. It uses a central platform containing machine learning algorithms and incorporating a privacy management system for data sharing.

The latest US census was released with differential privacy, in order to protect individuals from being identified while making aggregated population data available. And the UN PETS (privacy-enhancing technologies) Lab is testing a range of the above technologies to enable national statistics offices, researchers and companies to collaborate on shared data.

Making privacy-preserving technologies work

Nonetheless, the path is not entirely smooth. Privacy preserving technologies come with trade-offs. Where modelers don’t see the data, they need to send models back to the data owner to run them, slowing the process. Techniques like homomorphic encryption are computationally intensive. By obscuring data with differential privacy, you lose accuracy in some use cases.

No technique is a silver bullet. Preserving privacy will need layers of these technologies and careful thought to the right balance for your use case.

And, as with all data projects, good models need good underlying data. For privacy preserving technologies to work, the data owner needs to apply good data management practices. Since some modelers won’t be able to see the data, it is all the more important that it is curated so as to handle anonymous queries.

Finally, it is critical to note that privacy preserving technologies should not be an add on but a fundamental part of design. Any process that needs to share private data should take a privacy-first approach. Start by thinking about the privacy implications of the data behind the product, and bake in the right tools from the start so that you can get the insight you need whilst preserving user’s privacy.

When deployed from the start – with the right bedrock of data management, and agreements – privacy preserving technologies can help convince customers to share data, and to navigate the trade-off between respecting privacy and maximizing access to useful data.

The point of all of this is to do more with the data we collect. Broader, deeper and more representative data allows us to build more accurate, generalizable and useful models that underpin intelligent and personalised products and services. Doing this will be hugely valuable, but doing it means protecting and respecting the privacy of those who share their data with us.


How we can help

We have experience using privacy-enhancing technologies, including differential privacy, federated learning, and homomorphic encryption – all technologies that are hard to implement. We have a close eye on future developments, so we are ready to deploy advances as they arrive.

To help address the challenges discussed in this article, we can help deploy these in ways that allow private data to be safely shared within or beyond your organisation, to allow advanced modelling to be performed whilst preserving privacy. We can also help ensure the underpinning data sources and data management practices are of high quality, make sharing your data using privacy preserving technologies viable.