ENCRYPT Blog Series #5: Data Preprocessing

Data Preprocessing: A Vital Step in Privacy-Preserving Technologies
by Angelos Papoutsis, Data Science/Cybersecurity Researcher, Centre for Research & Technology Hellas

In today’s digital world, data privacy has become a major concern for individuals, organizations, and governments. The increasing amount of the collected, processed, and shared sensitive data has made it imperative to develop Privacy-Preserving Technologies (PPTs). However, the effectiveness of these PPTs is heavily dependent on the quality of the used data. This is where data preprocessing gets in the spotlight.

Data preprocessing refers to the process of cleaning, transforming, and preparing raw data before applying any analytical methods or PPTs. It is a crucial step that can significantly impact the accuracy, efficiency, and privacy of the obtained results. The aim of this blog post is to highlight the importance of data preprocessing in general and its relevance to privacy-preserving technologies (PETs).

Importance of Data Preprocessing
Data preprocessing is an essential step in data analysis as it helps to improve the quality of the data, making it more suitable for analysis. Some of the key reasons why data preprocessing is important can be summarised as:
1. Improving Data Quality: Raw data often contains errors, inconsistencies, and missing values, which can affect the accuracy of the results. Data preprocessing techniques such as data cleaning, imputation, and normalization can assist in identifying and correcting such errors, thereby improving the quality of the data.
2. Enhancing Performance: Preprocessing can also facilitate the reduction of computational requirements of analytical methods. For instance, feature selection can be used to identify the most important features in the data, thereby focusing only on the important aspects of the datasets and eventually improving the efficiency of the analysis.
3. Facilitating Interpretation: Preprocessing can also assist in making the data more interpretable by transforming it into a more suitable format. For example, feature scaling can be used to standardize the range of values of different features, making it easier to compare them.

Importance of Data Preprocessing for PETs
PETs are designed to protect the privacy of sensitive data by adding noise or obfuscation to the data before sharing or analysing it. However, the effectiveness of these techniques is heavily dependent on the quality of the data being used. The following are some of the key reasons why data preprocessing is important for PETs:
1. Ensuring Privacy: In some cases, correlations among data can reveal relations, decreasing the privacy that Differential Privacy (DP) offers. Preprocessing techniques such as feature selection can help in reducing the amount of correlated data, thereby enhancing the privacy of the data.
2. Reducing Execution Time: In the case of Homomorphic Encryption (HE), the execution time highly depends on the amount of data to be encrypted. Preprocessing techniques such as feature selection or data sampling can help to reduce the amount of data to be encrypted, thereby reducing the execution time.
3. Enhancing Accuracy: Preprocessing can also help to improve the accuracy of the results obtained from PETs. For example, feature engineering techniques can be used to extract representative features for each domain, making it easier to identify patterns and anomalies in the data.

ENCRYPT project will leverage on different pre-processing techniques, such as data clearing and sampling, to transform raw data into a compatible format for data analysis. Feature engineering techniques, such as statistical procedures, will also be implemented to extract representative features for each domain. Finally, techniques for learning representations that decrease the mutual information between the raw data and their representation will be utilized. This can include data transformations that help to protect sensitive information from unauthorized access and disclosure.

Conclusion
Data preprocessing is a vital step in the data analysis pipeline, and its importance cannot be overstated. It enhances the quality of the data, reduces computational requirements, and facilitates interpretation. In the context of PETs, data preprocessing becomes even more important, as it can assist in enhancing privacy, reducing execution time, and improving accuracy. Therefore, it is essential to pay close attention to data preprocessing when designing and implementing PETs. During the ENCRYPT project, different preprocessing steps will be followed to ensure the proper application of the different PETs.