Record Linkage in Healthcare Research

Background

Record linkage is the process of identifying and connecting records that refer to the same entity, such as a patient, across different databases. In health research, this process is essential for building a comprehensive understanding of patient health trajectories. By linking data from different sources such as hospitals and research institutions, researchers can compare long-term health trends, disease progression and treatment outcomes.

Problem

Record linkage becomes particularly challenging when data is distributed across multiple healthcare institutions or countries, especially in regions like Germany where data protection laws are notably strict. In Germany, health data is considered highly sensitive, and its use or transfer requires a clear legal basis and/or explicit patient consent. The legal framework is governed by a complex interplay of national laws, state-specific regulations and the General Data Protection Regulation (GDPR), all of which impose strict requirements on the collection, processing and sharing of personal data. Additionally, Germany lacks a universal patient identifier, which increases the risk of linkage errors and makes it difficult to harmonize pseudonymization practices across institutions.

Motivation

The challenges facing record linkage in Germany and other regions, highlight the urgent need for solutions that balance data privacy with research utility. There is growing recognition among researchers, policymakers and data custodians that harmonized legal frameworks, standardized technical infrastructures and privacy-preserving technologies are essential for unlocking the full potential of health data.

Initiatives such as the German Medical Informatics Initiative (MII) and NFDI4Health are working towards developing federated infrastructures, metadata standards and secure pseudonymization methods. Initiatives such as the ENCRYPT project are addressing this through privacy-preserving data processing technologies.

Protecting Privacy while Linking Data

To tackle this challenge, we developed a secure solution within the ENCRYPT platform that ensures patient privacy is respected at every step. Our approach is based on using a specialized, cutting-edge privacy-preserving technology for secure computation known as a Trusted Execution Environment (TEE) and by building a custom application designed to operate entirely within this secure environment. 

What Is a Trusted Execution Environment (TEE)?

A Trusted Execution Environment, or TEE, is a secure area within a computer’s processor that is isolated from the rest of the system. It allows sensitive data to be processed in a way that prevents access by unauthorized users, even if they have control over the system itself. TEEs offer a strong guarantee that both the computation and the data remain confidential throughout processing and thus help with complying with legal and ethical standards for healthcare data handling.

One of the key stages of working with TEEs is a procedure called attestation. Attestation is a mechanism that confirms the integrity and confidentiality of the TEE before any sensitive operation takes place. It assures data providers and users that the environment has not been tampered with and can be trusted to handle confidential information.

What Is a Trusted Execution Environment (TEE)?

A Trusted Execution Environment, or TEE, is a secure area within a computer’s processor that is isolated from the rest of the system. It allows sensitive data to be processed in a way that prevents access by unauthorized users, even if they have control over the system itself. TEEs offer a strong guarantee that both the computation and the data remain confidential throughout processing and thus help with complying with legal and ethical standards for healthcare data handling.


One of the key stages of working with TEEs is a procedure called attestation. Attestation is a mechanism that confirms the integrity and confidentiality of the TEE before any sensitive operation takes place. It assures data providers and users that the environment has not been tampered with and can be trusted to handle confidential information.

Our Secure Application: An Overview

We developed a Python-based application composed of three main components:

  1. Wrapper (Controller): This component is responsible for setting up the secure environment, configuring the TEE and coordinating the different stages of the application.
  2. Server: This module runs inside the TEE and securely receives data records from multiple sources.
  3. Main Application: Once all data has been received, this component performs record linkage by identifying and consolidating duplicate records that refer to the same individual.

Each of these components plays a critical role in maintaining security, privacy, and functionality throughout the process.

Step by step: Secure Processing Inside the TEE:

  • Initialization and Attestation: When the application starts, the Wrapper configures the Trusted Execution Environment and deploys secure containers. it also performs attestation to ensure that the environment is both confidential and trustworthy. Once verified, the Wrapper launches the secure server inside the TEE.
  • Secure Data Collection: The Server, operating entirely within the TEE enclave, receives sensitive health records from various sources – such as different hospitals and stores them in a protected area. This design ensures that no data is exposed to the untrusted environment at any point.
  • Privacy-Preserving Record Linkage: Once all data has been collected, the main application is executed within the TEE. It performs record linkage to detect when the same patient appears in multiple datasets. Because Germany lacks a universal patient identifier, our logic assumes that a match occurs when records have the same name, gender, date of birth and health insurance provider, but have different medical encounter dates. We also assume that a patient does not change their name, gender, or insurance provider across institutions.

When a match is found, the application merges the records into a single unified profile. It also updates all associated references (such as treatments and diagnoses) so they correctly point to the merged patient identity. The final output mirrors the original data structure (HL7 FHIR), but with consolidated and de-duplicated entries and reflects complete individual journeys across multiple data source.

Testing Privacy-Preserving Record Linkage with Synthetic Data

To test the Privacy-Preserving Record Linkage capabilities, we utilized a synthetic dataset availed by the MII consortium. This synthetic dataset simulates patient medication histories and demographic information, allowing us to replicate a real-world data scenario while ensuring compliance with the privacy requirements for cross-institutional healthcare research.  

A Unified and Privacy-Preserving Dataset

At the end of the process, our system produces a new dataset that preserves the original structure while providing a more comprehensive and accurate view of individual’s record history across different data sources. 

All operations from secure data ingestion to record linkage and generation of the final dataset, take place entirely within the secure TEE. This guarantees that sensitive data remains protected at every step.  

By integrating advanced privacy-preserving technologies like TEE, the ENCRYPT platform enables collaborative medical research across institutions without compromising individual privacy or regulatory compliance. This use case demonstrates the platform’s potential to effectively addresses the complex legal and ethical challenges of record linkage in highly regulated contexts like Germany, where health data is subject to strict protection under the General Data Protection Regulation (GDPR), national laws and state-specific requirements.