Safeguarding Data Privacy in Distributed Learning: Mitigating Risks and Harnessing Homomorphic Encryption
by Martin Zuber, Cryptography Researcher, CEA
Distributed Learning: In the context of machine learning, distributed learning, also known as federated learning or collaborative learning, plays a crucial role in advancing the capabilities of artificial intelligence. By leveraging the collective power of multiple machine learning models across distributed devices or workers, it offers several important benefits. It addresses privacy concerns by allowing training on decentralized data which is kept private. It promotes scalability and efficiency by having the computational burden shared. It enhances robustness and generalization by diversifying the training datasets.
All of these benefits can be gained in the Health domain use-case of ENCRYPT, where multiple hospitals want to collaborate to train a machine learning model on their combined dataset without compromising patient privacy. These are the steps (as illustrated in the figure):
1. Data Preparation: Each hospital prepares its local dataset by anonymizing and removing any personally identifiable information (PII) to protect patient privacy. The data should be in a standardized format suitable for model training.
2. Local Model Training: Each hospital independently trains a local machine learning model using its own dataset. This training can be performed using various algorithms and techniques depending on the specific task or problem being addressed.
3. Model Updates: Periodically, each hospital sends updates of its locally trained model to the aggregator. These updates typically consist of model parameters, such as weights and biases, rather than the raw data itself. This helps preserve data privacy, as sensitive patient information remains on the hospital’s premises.
4. Aggregation: The aggregator receives the model updates from each hospital and combines them to create a global model. Various techniques can be used for aggregation, such as federated averaging, where the aggregator calculates the average of the model parameters received from different hospitals.
5. Model Distribution: The updated global model is then distributed back to the participating hospitals. Each hospital receives the updated model, which now incorporates knowledge from the collective training performed across all hospitals.
6. Local Model Improvement: Upon receiving the updated global model, each hospital incorporates the new knowledge by further fine-tuning or retraining its local model using its own data. This process helps refine the model’s performance and adapt it to the specific characteristics of each hospital’s dataset.
7. Iterative Process: Steps 3 to 6 are repeated iteratively, allowing hospitals to continually refine and improve their local models by leveraging the collective knowledge of the distributed learning process. The aggregator serves as a central coordinating entity, facilitating the exchange of model updates and promoting collaboration among the participating hospitals.
By following this distributed learning approach, hospitals can collectively build a robust and generalized machine learning model while maintaining data privacy and security.
Risk of an Honest-But-Curious Aggregation Server: While distributed learning with an aggregator offers benefits, it’s important to consider the risks associated with an honest-but-curious aggregator. In this scenario, the aggregator may not intend to misuse or leak the data, but it may still have the ability to access and infer sensitive information from the model updates sent by the hospitals. Depending on the structure and content of the model updates, the aggregator might gain insights into certain patterns, trends, or characteristics of the hospital’s data, potentially breaching patient privacy or revealing sensitive information.
The Role of Homomorphic Encryption: To address the privacy concerns and mitigate the risks associated with an honest-but-curious aggregator, homomorphic encryption techniques can be employed. Homomorphic encryption allows computation to be performed directly on encrypted data without requiring decryption, thus preserving data privacy. By using homomorphic encryption, hospitals can encrypt their model updates before sending them to the aggregator. The aggregator can then perform the necessary computations on the encrypted updates to aggregate the models without having access to the underlying raw data. This approach ensures that the aggregator only receives encrypted information, making it computationally infeasible for the aggregator to decrypt and retrieve sensitive patient data. Homomorphic encryption provides a secure solution for distributed learning, allowing hospitals to collaborate and build models while maintaining data privacy and confidentiality.