Access to a vast pool of publicly accessible data, combined with the availability of ever more powerful analysis technology, has increased the risk of re-identification. Re-identification means that it is possible to identify specific individuals from what, at the outset, is presumed to be an anonymous data set. Studies have shown that by collating data from several sources, it is possible to re-identify a person from only two known attributes in an anonymised data set, such as postcode and birthdate.
Re-identification can occur when someone takes personal data they already have on an individual, and searches for hits in an anonymised data set, or when a hit in an anonymised data set is used to search for hits in publicly accessible data. Examples of such publicly accessible data include data from public records (eg tax lists, the Brønnøysund Register Centre, the vehicle registration database, the electronic public records system (OEP)), social media, local and national media archives and genealogy websites.
There have been several known cases of re-identification, the majority performed by researchers on real data sets. In the majority of these cases the data had been poorly anonymised to begin with. For example, the organisations had retained too many identifying elements in the data set, or, in connection with the publication of the data, had not analysed which other accessible data sets existed “out there” that could be used to deduce information from their own data set
Certain types of data are more difficult to render anonymous than others. This applies to localisation and genetic data, for example. People’s patterns of movement are so unique that the semantic part of the localisation data – the places where the data subject has been at a certain point in time – can reveal a lot about the data subject, even without other known attribute values. This has been demonstrated in many representative academic studies (de Montjoye et al: «Unique in the Crowd: The privacy bounds of human mobility», Nature, 3, 1376 (2013)).
Genetic data profiles are another example of personal data that risks being re-identified if the only anonymisation technique used is to remove the data subject’s identity. This is because the genes are inherently unique. Studies have shown that the combination of publicly available genetic resources (such as genealogy databases, obituaries, search engine results) and metadata about DNA donors (donation time, age, address) can reveal certain people’s identity, even though they have provided the DNA sample “anonymously”.
Pseudonymised data is not synonymous with anonymised data, as pointed out in Chapter 2. Pseudonymisation does allow individuals to be identified. Data controllers who choose to pseudonymise data, rather than anonymise them, must be aware that the data will still be defined as personal data, and must therefore be processed in accordance with the Personal Data Act’s provisions.
There are several examples of data controllers believing that they have anonymised data, while – in realty – they have merely pseudonymised them by, for example, replacing the subject’s name with a serial number.
Encryption vs anonymisation
Encryption is used to protect confidentiality in a communication channel, and has an entirely different purpose to anonymisation. The purpose of anonymisation is to avoid individuals being identified.
Another widely held misconception is that encrypted or single-coded (hashed) data is the same as anonymised data. This misconception rests on the following two assumptions: a) that when an element in a database (eg name, address, birthdate) has been encrypted or replaced by a randomised code with the help of encryption technology (eg a hash function with a key), this element has been anonymised; and b) that anonymisation is more effective if the key’s length is correct and an advanced encryption algorithm has been used.
It is important to understand that the objectives of encryption and anonymisation are different. Encryption is a security method intended to protect confidentiality in a communication channel between two identified parties (people, entities or software programs) to prevent eavesdropping or unintended publication. The purpose of anonymisation, on the other hand, is to avoid individuals being identified by preventing hidden connections between attributes linked to the subject of the data.
Neither encryption nor key coding as such helps to make the data subject unidentifiable, since the original data are still accessible or can, at the very least, be deduced by the data controller. Being able to perform a semantic translation of the personal data, as in the case of key coding, does not eliminate the possibility of recreating the data in their original structure.
Advanced encryption can help to provide better data protection by making them incomprehensible to parties that do not have access to the encryption key, but this does not necessarily render the data anonymous. As long as the key to the original data is accessible (even though it is kept by a trusted third party), the possibility of the data subject being identified is not eliminated. Encrypted and single-coded data must therefore be deemed to constitute personal data, and must be treated as such.
Challenges in brief
- The danger of re-identification: When data from several sources are compared against each other, there is a risk that individuals in what are nominally anonymous data sets may be identified.
- Two known data points, e.g. postcode and birthdate, could be enough to re-identify individuals from a data set.
- Pseudonymous data must not be confused with anonymous data. Pseudonymous data are personal data, and their processing falls within the scope of the Personal Data Act.
- Encryption is not the same as anonymisation. The objective of encryption is to protect the data, not make them unidentifiable.
This chapter is based partly on the Article 29 Working Party's opinion on anonymisation techniques (pdf)