Recommondations for secure anonymisation
Should data be anonymised, pseudonymised or left identifiable?
At an early stage, the data controller must decide whether the personal data to be processed should be anonymised, de-identified or left identifiable. This choice will affect what the organisation must do in relation to the Personal Data Act when it processes the data concerned. If anonymisation is chosen, any further processing will, as previously mentioned, fall outside the scope of the Personal Data Act.
Describe the purpose of anonymisation
It is important to be clear about the purpose of anonymisation. The way in which the data set will be used plays a crucial role in determining the risk of re-identification. If the data are to be published online, this represents a bigger risk than restricted release of the data for research purposes or for use in one’s own organisation, for example. Even though restricted release is easier to control and assess, this too is not without risk.
Several factors must be taken into account when data are to be anonymised. For example:
- What type of data is to be anonymised?
- What control mechanisms are linked to the data set? What security precautions will limit access to the data?
- How big is the data set? (What quantitative properties does it have?)
- Which other publicly accessible information resources exist, which could potentially be used to deduce information about individuals in a data set that is to be anonymised?
- Will the data set be released to third parties? (If so, will access be limited, or will it be made publicly accessible online, for example?)
- Might unauthorised third parties wish to stage a targeted attack on the data in an attempt to identify individual people? (In this kind of risk assessment, the data’s sensitivity and type is particularly important.
The AOL case is a typical example of mistaken pseudonymisation.
In 2006, a database containing 20 million search words for over 650,000 users was published. The only thing AOL had done to protect the privacy of its users was to replace their AOL username with a serial number. This resulted in some of the people being identified and localised.
Pseudonymised search strings for search engines have an extremely high identification rate, particularly if they are linked to other attributes like IP addresses or other client configuration parameters.
Perform a risk assessment both before and after publication
It is almost impossible to assess the risk of re-identification with absolute certainty. To obtain an absolute overview of all the other information that is “out there”, who it is accessible to and how it may be used in a re-identification attempt, is extremely challenging. Such “other information” could be information that is accessible to another organisation or particular person, or it may be generally available online.
Since you can never be entirely certain about which data are available at any given time, or which data will be made accessible at some point in the future, it is necessary to perform as thorough a risk assessment as possible, early in the anonymisation process.
Furthermore, it is important that the data controller also remembers to consider the risks surrounding a data set after it has been published. Due to the risk of re-identification, the data controller should regularly investigate whether new risk factors have cropped up, and re-evaluate the risk factors already identified. It is also important to assess whether control of the identified risk factors is adequate or needs to be adjusted.
If someone should succeed in re-identifying the data, and this results in personal data being processed, the organisation responsible for the data must assume the role of data controller for them, in accordance with the Personal Data Act.
Use of the re-identification test
One test that can be useful to perform to examine the risk of re-identification, is what is known as the “motivated intruder test”. This involves testing whether the data can be re-identified if an intruder should attempt to do so.
The motivated intruder must be considered to be a person/organisation which, with no prior knowledge, attempts to identify individuals in an anonymised data set. Even though the intruder has no prior knowledge, they can be considered as sufficiently competent. In other words, they have access to resources such as the internet, public databases and libraries. It must also be presumed that they will be able to use various investigative techniques to communicate with people who know the identities of individuals in the data set. However, the motivated intruder should not be deemed to have specialist knowledge, such as expertise in data hacking, or access to specialist equipment to force access to data that is securely stored.
Some types of data are obviously more attractive for a motivated intruder than others. There may be many reasons why someone would attempt to re-identify individuals in an anonymised data set. These include:
- Disclosure of newsworthy information about public figures.
- Political or social activism, e.g. as part of a campaign against a particular organisation or person.
- Inquisitiveness – a desire to find out who is involved in a local planning application, for example, or simply to see whether it is possible to deduce information linked to an individual from the data set.
However, this does not mean that data which, at first glance, appears to be “ordinary” “innocuous” or without
value can be published without a careful assessment of the threat associated with re-identification.
The motivated intruder test is useful because it sets a threshold for the risk of re-identification higher than that at which a normally inexpert citizen could manage to re-identify the data, but lower than that at which an expert with access to specialist competence, analytical resources and prior knowledge could do so. It is good practice to perform a motivated intruder test as part of any risk assessment. In practice, the performance of a motivated intruder test could, for example, comprise:
- Performing an online search to find out whether a combination of birthdate and postcode could be used to reveal an individual’s identity.
- Searching in the archives of a national or local newspaper to see whether it is possible to link the victim’s name to data about where crimes have been committed.
- Using social media communities to see whether it is possible to link anonymised data to a user’s profile.
- Using data from various public databases to try and link anonymised data to someone’s identity.
Since access to available information and computing power is increasing all the time, it is important to carry out motivated intruder tests regularly to re-evaluate the risk of re-identification.
Motivated intruder test – a checklist
- What is the risk of a so-called “jigsaw attack”, i.e. that various bits of information are put together to form a complete picture of one person? Are the data of such a nature as to enable them to be linked together? For example, is the same code used to refer to the same individual in several different data sets?
- What other type of “linkable” information exists that is easily or publicly accessible?
- What technical measures can be deployed to succeed in re-identifying the data?
- How much weight should be accorded to individuals’ possible knowledge of the data subject in an anonymised data set?
- If an intruder test has been performed, what weaknesses did it reveal?
Sources of information include ...
.. public libraries, public databases (tax lists, the Brønnøysund Register Centre, the vehicle licensing database), the electronic public records system (OEP) in which correspondence to/from the public administration is listed, parish records, genealogy websites, social media, search engines, local and national media archives, and anonymised data published by other organisations.
Anonymisation techniques must be determined on a case-by-case basis
There are various anonymisation techniques, which are more or less robust. The data controller must consider what guarantee against re-identification can be achieved through application of a specific technique, in light of three key risks associated with anonymisation:
- Is it still possible to distinguish one individual person in a data set?
- Is it possible to link together various data sets associated with one and the same person?
- Is it possible to deduce information associated with an individual person from the data set?
No anonymisation techniques fully meet the requirements for effective anonymisation. How a data set can best be anonymised must therefore be determined on a case-by-case basis, where account is taken of the purpose of anonymisation and the context involved. It is necessary to work consciously and meticulously to determine which technique best suits a particular situation, as well as how several techniques may possibly be combined to make the outcome even more robust.
|Is it still possible to distinguish one individual person in a data set?||Is it still possible to link together various data sets associated with one and the same person?||Is it still possible to deduce information associated with an individual person?|
|Addition of noise||Yes||Probably not||Probably not|
|Aggregation or K-anonymity||No||Yes||Yes|
|Differential privacy||Perhaps/probably not||Probably not||Probably not|
|Hashing (tokenisation)||Yes||Yes||Probably not|
Knowing the strengths and weaknesses of the various techniques makes it easier to decide how the data can be anonymised in the most secure way possible. The strengths and weaknesses of the various techniques are examined in detail in section 3 and 4 in The Article 29 Working Party's opinion on anonymisation techniques (pdf).
The table above provides an overview of the strengths and weaknesses of the various models, as they relate to the fundamental criteria mentioned above. A solution that meets all three criteria will be resilient against identification. If a proposed solution does not meet one of the criteria, a thorough assessment of the identification risks relating to the data set should be carried out.
- Choose one or more anonymisation techniques on the basis of the context and purpose concerned.
- The optimal anonymisation solution must be chosen on a case-by-case basis.
- All techniques have their strengths and weaknesses. So use a combination to achieve the best possible degree of anonymisation.
- Risk assessments must be performed not only before the data are published, but also afterwards. New risks may emerge.
- Use tests to assess the probability of re-identification.
- The data controller should disclose which anonymisation techniques or combination of techniques have been used when publishing anonymised data sets.
- Easily identifiable (i.e. rare) attributes should be removed from the data set.