Pseudonymisation of sensitive data in the cloud – Privacy by design
Sensitive data, for example in the health sector, must be specially protected in the cloud. You can do this by pseudonymising or anonymising the data – let me show you how.
by Lukas Fuchs
The fact that the public cloud has many advantages is no longer in question. However, especially in areas with sensitive data, you might still have a sinking feeling when the data treasures leave the hermetically sealed network boundaries of the traditional on-premises world in the direction of the cloud. You might ask yourself:
- Who has access to which sensitive data in the cloud and from where?
- What information could be read out and misused to what extent if someone gains access to the data in the cloud?
In this article, I will not only answer these questions, but also show thanks to which procedures these concerns can be eliminated. But let's take it one step at a time. For a simple understanding, let us imagine the following analogy:
I am a shepherd (responsible) who produces wool (product). I know that the best wool (added value) is produced when I bring the flock of sheep (data) from the village (on prem) to the alp (cloud), because there the meadows are lush and green (As a Service) and regrow quickly (scaling, costs). But I am aware: on the road (data in motion) and on site in the big enclosure (big data, data at rest), wolves (hackers) are threatening my sheep. How can I protect my sheep on the road and on site? Correct: with a hunter (security expert). However, he is neither awake around the clock nor always 100% accurate. I have an idea: we dye the wool of our sheep green like blueberry bushes, so that they become uninteresting for the wild animals (anonymisation), but they can still profit from the lush pasture, and I can safely increase my wool production. But now, there are special sheep that produce merino wool. As this merino wool has to be traceable back to the sheep through my transparent supply chain for my customers, I will give the sheep unique call names, which only they will listen to (pseudonymisation) and which the wolves will not understand. I have stored the list of names for the breakdown for the customers securely in the barn (key vault), where the wolves do not have access.
This analogy shows different use cases with regard to data protection which must be conceptually differentiated depending on the scenario. In the following graphic, the difference between pseudonymisation and anonymisation is again clarified using the example of the health sector:
Usually, a lot of data in operational systems is personal data. For a smooth and efficient process flow, the required information must be freely available to the authorised persons (need to know principle). For example, a doctor must be able to see all the required information on the patient's condition and development during the last shift in the patient documentation at shift handover.
If this data is now exported from the operational core systems to the cloud for analytical purposes, for example, it is still personal and must be protected. The data engineers or data scientists working with the data do not need any personally identifying information for their work, depending on the use case. They focus on the algorithms and data relations and try to generate new insights with them. This is where anonymisation and pseudonymisation come into play. If the person-identifying data is obscured when it leaves the operational systems, it can be worked with in a more risk-reduced manner, the questions about possible misuse and attack vectors are reduced to the usual security measures. In the case of anonymisation, all identifying data (or their combination) in the data records are made unrecognisable or removed. Nevertheless, the user data (i.e. data set without personal identifying data) of the data sets can still be used, for example, to create overarching trend analyses.
Use cases for working with anonymised data in the cloud are, in addition to healthcare, for example:
- Pattern recognition
- Trend detection
- Anomaly detection
- Data marketplace (internal, B2B)
- Statistical analyses
Depending on the use case, however, it is necessary for data to be protected during transmission and storage, but it must be possible to break it down again for individual analyses due to different initial situations, so that a doctor, for example, can be sure that the data enriched with new insights with the help of AI actually belong to patient X and not just to any patient. The pseudonymisation concept is per se somewhat more complex than that of anonymisation, since information must be safely recovered at a certain point in time and cannot simply be removed.
THE WAY TO THE CLOUD: OPPORTUNITIES & RISKS
The cloud offers companies a number of advantages that go far beyond pure cost savings. However, there are a number of things to consider and common stumbling blocks to avoid if you want to make your cloud journey as successful as possible. But what are these opportunities & risks of the cloud in concrete terms?
From the field: The AtemReich children's home is a home for children who depend on the help of machines to breathe. To support the medical specialists, the data streams generated by the machines are recorded and stored in the cloud. An AI identifies anomalies there or creates long-term analyses which would not be possible without the storage in the cloud. However, the children all have different initial conditions, medical diagnoses and treatments. In addition, the age of the young patients, ranging from a few days to 18 years, plays an important role. For this reason, specialists must be able to assign the data sets to a specific patient and identify their origin without any doubt. Especially when, for example, ventilation specialists or cardiologists want to carry out analyses on the condition and development of a specific patient, these data must not be "confused".
Possible areas of application for work with pseudonymised data in the cloud are typically individual-based analyses in the field of:
- Patient data (healthcare industry, e.g. treatment optimisation)
- Customer data (service industry, e.g. next best action, fraud detection)
- Citizen data (public sector, e.g. tax evasion)
As shown in the example of the AtemReich children's home, Trivadis – Part of Accenture uses pseudonymised patient data to identify anomalies in ventilation based on patient-specific models. To ensure that the data is secured according to the "privacy by design" principle, the patient associations of the health data are disguised "in motion" (i.e. when they are transmitted) and "at rest" (i.e. when they are stored). They can be clearly assigned to someone unknown in the analysis process, but in the event of data compromise, no conclusion can be drawn about the person behind it. However, since the data and knowledge consumers need to see recognizable names (patient names) in the reports at the end of the process in order to avoid confusion, the identifier-to-name assignments are stored securely in a specially protected, separate area in the cloud for re-coding. No personal data is stored in the reports themselves. Access to the secure mapping table, which holds this data for resolution in the report, takes place at runtime of the report via a configured access authorisation.
In the following, I explain the process step by step after an introductory overview.
Overview of the main components and actors in the end-to-end process:
- The protected mapping information is located in the "pseudonymisation zone" (cloud).
- In the analytical zone (cloud), user data is analysed and new algorithms are tested to generate new insights and forecasts.
- In the usage zone (cloud / on-prem), the results are shown broken down.
Step by step procedure of pseudonymisation and its resolution
Step 1 and 2: Create pseudonym
Identifiers to be obfuscated (such as personal names) are made known to the pseudonymisation mapping table (e.g. in a key vault or a database table) and their assigned, obfuscated value is transmitted back. This adjusted value could be woven directly onto the source system in a specific outbound zone, but usually the source systems are left untouched as much as possible. Instead, a small automated "helper" process is used (see the following detailed implementation example). This could be a PowerShell script, for example, which goes through locally cached, freshly exported raw data and replaces the sensitive values accordingly before they leave the "operative zone".
Detail implementation example
In this detailed example, a person master data system is used which is responsible for the person information.
Step 3: Transfer pseudonymised user data
The pseudonymised data and the user data are transferred to the cloud via a secure channel (HTTPS, VPN).
Step 4 and 5: Data analysis and knowledge enrichment
The transmitted and stored data can be analysed by specialists or processes without drawing conclusions about individuals.
Step 6 and following: Result (=processed data and findings) and use
At the end of the process, the data should be able to be consumed by a user without being disguised. For this purpose, the user data is read when the data is called up and, in parallel, the corresponding pseudonyms from the mapping table are resolved and displayed at runtime using a key stored on the service.
What happens in the event of an attack?
In the event of an "attack" on the cloud data storage itself (data at rest) or on the transmission to the data storage (data in motion), the sensitive data is thus – together with the other, usual security measures – protected in the best possible way and cannot be traced back to an individual person.
This pseudonymisation concept shows in practice how the "privacy by design" principle ensures that sensitive information cannot be traced back once it leaves its source. In addition, the applied "need to know" principle per role ensures that, from a "person identification perspective", each user group along the workflow only has access to exactly the information it needs for its job – and without any disadvantages for the work.