Logo and page links

Main menu

Finterai, exit report: Machine learning without data sharing

Finterai, exit report: Machine learning without data sharing

How can you learn from data you do not have? Can federated learning be the solution when data sharing is difficult? The Norwegian Data Protection Authority's sandbox has explored the challenges and benefits of federated learning, a machine learning methodology that is presumed to be privacy friendly, which the start-up enterprise Finterai wishes to use in the fight against money laundering and the financing of terrorism.

Summary

Finterai is a Norwegian start-up that wants to tackle a societal problem that has previously caused far larger organisations many a sleepless night, namely money laundering and the financing of terrorism. Banks are required to do what they can to prevent it, but struggle to do so in an effective manner.

The crux of the problem is that each individual bank has “too few” criminal transactions to provide a sufficiently reliable indication of what actually distinguishes a suspicious transaction from a run-of-the-mill one. As a result, the banks’ electronic surveillance systems flag far too many transactions (false positives), which trigger a time-consuming and costly manual follow-up investigation. The problem could potentially be resolved by building systems based on more data than currently exists. The challenge is that banks cannot share the necessary data among themselves since transactions contain personal data.

Can federated learning solve the conundrum?

Finterai wants to resolve this data-sharing problem by applying a relatively new method in the field of machine learning, called “federated learning”. Federated learning is a decentralised methodology within artificial intelligence (AI), which is deemed more privacy friendly than many other forms of machine learning. By using this method, banks can learn from each other without actually sharing data about their customers.

In the sandbox project, we have explored three questions that federated learning raises in relation to the data protection regulations. This has led to the three conclusions presented below.

Conclusions

  1. Processing responsibility: The banks themselves will always have a decisive influence over both the purpose of and the means for the processing activities discussed in this report and will therefore be the data controller. Finterai will probably not act as data controller for the activities, although a more detailed assessment of the legal basis and all the factual circumstances will be required before this can be conclusively determined. Finterai’s task will be to monitor the models’ vulnerabilities and ensure that they do not contain personal data. The company will therefore probably act as the banks’ data processor.
  2. Data minimisation: Each bank’s customer risk profile affects the requirements it must meet under the anti-money laundering regulations. This includes how much data must be collected on the bank’s customers. It can therefore be challenging to standardise the data categories that every bank must always have access to in order to participate in the federated learning process, while also upholding the principle of data minimisation. Nevertheless, we do not rule out the possibility of identifying some data categories that the banks may always be required to have access to. However, to meet the requirement for data minimisation, the system should be rigged in such a way that the banks can hold off from obtaining personal data until they know for sure that they will have a use for them.
  3. Security challenges: Use of federated learning has both strengths and weaknesses when it comes to information security and the protection of personal data. Federated learning reduces the need for data sharing. At the same time, it is a relatively new method. The solution makes extensive use of cloud services, which requires data security competence, but also ensures that the participating organisations can largely use their own capabilities and resources to secure their part of the solution. A potential attack vector related to federated learning is the model inversion attack, the purpose of which is to reconstruct (personal) data, based on access to trained models. The risk of this is deemed to be low but also difficult to evaluate.

What is the sandbox?

In the sandbox, participants and the Norwegian Data Protection Authority jointly explore issues relating to the protection of personal data in order to help ensure the service or product in question complies with the regulations and effectively safeguards individuals’ data privacy.

The Norwegian Data Protection Authority offers guidance in dialogue with the participants. The conclusions drawn from the projects do not constitute binding decisions or prior approval. Participants are at liberty to decide whether to follow the advice they are given.

The sandbox is a useful method for exploring issues where there are few legal precedents, and we hope the conclusions and assessments in this report can be of assistance for others addressing similar issues.

About money laundering

Finterai is a start-up enterprise, established in Oslo in 2021, which aims to supply financial technology (fintech) to banks and regulatory authorities. The company’s services address the global challenges posed by money laundering and the financing of terrorism.

What problem does Finterai seek to resolve?

At its root, money laundering is a way of securing the proceeds of crime. The purpose of money laundering is therefore to make these proceeds appear to have been acquired in a lawful manner by disguising/concealing the funds’ illegal origins before they are integrated into the legal economy.

Money laundering in the Financial Supervisory Authority of Norway's sandbox

Combating money laundering has also been the topic of a project in the Financial Supervisory Authority of Norway’s sandbox. In their project, the Financial Supervisory Authority of Norway (FSAN) and Quesnay assessed opportunities and constraints afforded under the Norwegian Anti-Money Laundering Act for a technical solution for the sharing of information between reporting entities, which could make endeavours to combat money laundering and the financing of terrorism more effective.

Read the project’s final report (finanstilsynet.no).

The financing of terror is the receiving, sending and collection of funds with the intention or knowledge that the money will be used to finance an act of terrorism or will be used by a terrorist group or a person acting on behalf of a terrorist/terrorist group. Terrorism may be financed both by the proceeds of crime and/or by legally acquired funds.

The UN estimates that laundered money accounts for two to five per cent of the global economy – approx. twice the size of Norway’s sovereign wealth fund, the Government Pension Fund Global (GPFG).

Unfortunately, international research also indicates that the public authorities succeed in reclaiming as little as 0.1 per cent of the illegal financial gain. This therefore constitutes a substantial loss for the victims of crimes for profit and for society at large. The EU estimates that it loses up to EUR 1 trillion in money laundering-related tax evasion every year.

Read more about money laundering on the UN’s website

Read the research article “Anti-money laundering: The world's least effective policy experiment? Together, we can fix it”

Read more about the EU’s estimate of how much is lost in money laundering-related tax evasion every year

According to the Norwegian Anti-Money Laundering Act, financial institutions must strive to prevent money laundering and the financing of terrorism. In practice, this means, for example, that they have a responsibility to ensure they are not misused to conceal the origins of ill-gotten gains. As a result of this responsibility, financial institutions must understand their customers’ transactions and assess the risk of money laundering.

A lot of wasted effort

Finterai believes that the main problem in the fight against money laundering and the financing of terrorism is the overwhelming amount of futile investigative work. Banks are compelled to use electronic surveillance systems, but Finterai considers these systems to be extremely imprecise. This leads to many “false positive” transactions being investigated. A large number of false positives means that transactions are presumed to be suspicious when they are not actually criminal.

Under the Anti-Money Laundering Act, banks are obliged to investigate all suspicious transactions. A large number of false positive transactions therefore generates a great deal of investigative work for the banks. In light of this, Finterai is therefore offering a machine learning tool to improve the banks’ electronic surveillance systems.

The challenge with using machine learning for this purpose is that money laundering and the financing of terrorism account for a tiny fraction of the total number of financial transactions in most banks. This means that the strength of the “criminality signal” in the data is weak. Most machine learning models must therefore have vast quantities of data for them to work well – more than each bank has on its own.

Impact of a stronger “criminality signal”

If the banks were able to share data among themselves, the “criminality signal” would be strong enough for machine learning to work well. It would improve the banks’ electronic surveillance systems and increase the likelihood of the investigative effort being targeted more effectively:

  1. The total investigative load would be reduced.
  2. It would be possible to take practical steps to counteract real money laundering attempts at an early stage.

The problem with sharing data is that transactions contain personal data. But could this problem be resolved by means of federated learning?

What services does Finterai offer?

Finterai develops machine learning technology based on federated learning, a service which helps banks work together to combat financial crime. The model is trained to identify suspicious transactions partly on the basis of transaction history. Finterai's concept is for the banks to learn from each other’s datasets. However, this is made difficult by legal restrictions on the type of data the banks can share with one another.

In order for the banks to benefit from each other’s datasets without also sharing personal data, Finterai makes use of federated learning. Finterai’s ambition is to make it easy for a bank to develop robust investigative systems based on machine learning, in conjunction with other banks.

About federated learning

Federated learning is a method of machine learning. Machine learning is a branch of artificial intelligence (AI) that involves the development of algorithms which “learn” by identifying patterns and connections in large datasets.

In principle, a machine learning system requires data to learn and solve problems. And, in general, the more data it is fed, the better it becomes at solving problems. However, developers can find it difficult to access sufficient data to develop good algorithms. This is especially true when the data contains personal data, the processing of which is strictly regulated.

If an organisation sees that it needs more data, it can join forces with other organisations. In general, the organisations upload their data to a shared central server or machine, which all the partner organisations can use to train machine learning models. However, it is not possible to share data if there is no legal basis for processing the underlying personal data. There is, therefore, great potential for artificial intelligence that can use large volumes of data to learn from but that does not require the sharing of personal data. And this is precisely the main purpose of federated learning – to achieve big data machine learning without data sharing.

History

Federated learning was developed by Google in 2016. Google used the method to train a machine learning model using data located on mobile phones without uploading the data to a centralised network. The objective was to build machine learning models that were updated on the basis of data stored on the users’ mobile phones. The technology was, for example, used in the keyboard application Gboard to predict which word was being typed in. Since then, the technology has been shared and used in other contexts.

Read Google’s own blog post about federated learning

Read the article “Federated Learning for Mobile Keyboard Prediction” on Google Research

In the past few years, various organisations have carried out research on federated learning, which has generated several types of alternative setups for the method. Nevertheless, federated learning remains a new tool and there are still few commercial or public applications involving large volumes of data.

How does federated learning take place?

Different models for federated learning

The most common architecture for federated learning is called “horizontal federated learning”. A less frequently used alternative is “vertical federated learning”. This method is more common if two entities share a dataset. Other architectures include “federated transfer learning”, “cross-silo federated learning” and “cross-device federated learning”. With vertical federated learning, entities have different columns/categories of data – in this situation, data standardisation is not required.

Federated learning can be used in different ways to train artificial intelligence. Below, we have described the step-by-step processes in a commonly used model for federated learning (based on the model Google developed in 2016).

  1. Each participant receives a machine learning algorithm.
  2. Each participant uses the local dataset to train the machine learning algorithm.
  3. Each participant encrypts its local “learning package” and sends it to an external, central server. The learning packages do not contain personal data.
  4. The server performs a secure aggregation of the packages.
  5. The aggregation of the packages is used to update the machine learning models, which are stored centrally, with learning from the participants. The machine learning model that is stored centrally is the same as the original model sent out to the participants for local training.
  6. Steps 1 to 5 are repeated until the machine learning model is fully trained.
  7. Each participant receives the fully trained machine learning model and is now able to make better local predictions.

Because only the model parameters are exchanged, local data – which often comprises personal data – does not in theory have to be transferred between participants or between participants and the central server. The built-in limitation on the sharing of local data means that federated learning is considered to be a more privacy-friendly approach to artificial intelligence.

About the project

A key question regarding the use of federated learning – including in this sandbox project – is whether the machine learning models that are exchanged between participants contain personal data from local data. The answer is important from a regulatory point of view, since the data protection regulations only apply to the processing of personal data.

When the federated learning method is applied to personal data, it is in principle only the learning, or “model parameters”, that are supposed to be shared between participants. Nevertheless, there is a hypothetical possibility that personal data may be deduced if the model has vulnerabilities. Although personal data are not sent or stored externally, the weights (model parameters) are shared. The model parameters are the weights that represent the model’s learning. If the model has learned personal data, the weights could hypothetically reveal this information to any ill-intentioned participants who actively attack the model.

However, if the local data does not leave the banks’ local datasets, what is then shared? The answer is the model parameters and hyperparameters.

Model parameters and hyperparameters

Hyperparameters establish the framework for how machine learning is to take place. In other words, they define what the learning is to be based on and determine how datapoints are co-related. On the other hand, the model parameters contain the specific weights (contents) that the model is supposed to learn from.

A backpropagation algorithm is used to train the model parameters. This algorithm identifies how weights should be changed to make the machine learning solution’s predictions more precise. The learning solution's predictions are what finally result in the identification of the money laundering risk. The processes that make this possible consist of several steps, but the use of federated learning is crucial.

The extent to which a machine learning process based on federated learning permits the re-identification of data used to teach it depends on the design of the specific model and the training process. Challenges must therefore be assessed on the basis of the specific choice of solutions architecture and machine learning model. This is addressed in the chapter on security challenges.

How does Finterai intend to use federated learning?

To realise its ambition, Finterai will use federated learning in a slightly different way to Google’s version of the technology. The biggest difference comes right at the beginning, with a step before the model is distributed to the participants. In this step, the participants themselves decide what kind of model is to be trained, with each participant defining the model’s hyperparameters. In other words, it is Finterai’s customers and not Finterai itself who define which machine learning models are to be trained federally.

This leads to a new difference in the system. Finterai’s federated learning process is serial rather than parallel, in that a machine learning model is trained first by one participant before being sent to the next one. This contrasts with Google’s approach, in which machine learning models are sent out simultaneously to all participants, who then update the central model continuously. There is also another important difference: Google gets only a model update (gradients) back from its participants, while Finterai receives the entire machine learning model.

Model updates are smaller than the entire machine learning models, which thereby reduces network traffic. However, Finterai chooses to transfer entire machine learning models for technical, security and commercial reasons. Finterai also implements “secure aggregation” differently to Google. The difference is partly a function of the fact that the enterprises address different “use cases”.

Finterai aims to perform explicit tests of security threats, distortion issues and data leakage threats that may arise during the federated learning process. This provides a more robust degree of personal data protection and system protection than Google’s original model afforded. It is worth noting that such problems will arise in any situation in which machine learning models are shared or made available. In other words, these are not threats unique to federated learning.

Simplified, step-by-step presentation of Finterai’s federated learning process:

  1. A participating bank asks Finterai to build a machine learning model. The participant supplies Finterai with its own hyperparameters and other training instructions.
  2. Finterai builds a model based on the instructions received.
  3. Finterai sends this model, with the defined hyperparameters, to the first participating bank for training on its local dataset.
  4. The first participating bank receives the model, with the hyperparameters that describe the training. This training takes place locally, using standardised transaction data and other data (KYC and third-party data).
  5. Once the training has been completed locally at the participating bank, the model and hyperparameters are returned to Finterai. The model is then stored in Finterai’s database.
  6. Finterai checks the quality of the model, looking for data leakages and distortions.
  7. Finterai sends the updated model and relevant hyperparameters to the next participating bank.
  8. Participant 2 receives the model and the hyperparameters. The model is trained locally by Participant 2 on the same type of data as in step 4.
  9. Steps 5 to 9 are repeated until the model is fully trained – i.e. it has converged.
  10. Finterai stores the fully trained model on a server. All the participants in the federated learning process have access to the models, which can be downloaded from the server and used immediately on the banks’ local datasets to identify suspicious transactions.

In this model, all data storage related to these processes (including transaction monitoring) takes place at the banks. Finterai will not need access to the banks’ local data containing transaction details to develop or operate the service.

Discussions between the Norwegian Data Protection Authority and Finterai

The Norwegian Data Protection Authority and Finterai have held five workshops to discuss the technology Finterai is planning to use and challenges relating to the data protection regulations. At the first workshop, Finterai was still at the concept development stage with regard to its solution. Much of the discussion therefore concerned how Finterai could design the solution in a way that best protected data privacy. Although the Norwegian Data Protection Authority has not attempted to influence Finterai’s method, the discussions helped to highlight the consequences of the choices it makes when designing the solution.

One tangible learning outcome from the workshops is that developers can design the federated learning method in many different ways. The different designs will to varying degrees address important data protection considerations and open up for vulnerabilities. Choices that would result in the collection and centralisation of the banks’ transaction data on a central server could potentially create a large attack surface and trigger extensive demands for technical and organisational measures.

At the time of writing this final report, Finterai has elected to pursue a more decentralised solution, which minimises the system's attack surface, since different data storage systems can rarely be attacked through the same vulnerability. This will also have an impact on security threats, discussed in the chapter on security challenges. Information concerning the requirements for, and consequences of the different systems architectures has been extremely important for Finterai, since it has helped the company to make good choices at an early stage, which was otherwise characterised by much uncertainty.

The Financial Supervisory Authority of Norway's involvement in the project

This project touches on the relationship between the anti-money laundering regulations and the data protection regulations, which both safeguard important social considerations. The Financial Supervisory Authority of Norway (FSAN) is responsible for verifying that financial institutions complywith the anti-money laundering regulations. However, the FSAN has played no formal role in the Norwegian Data Protection Authority’s Finterai sandbox project.

These two regulatory frameworks are intended to safeguard principles that are to some extent contradictory, with certain unclear boundaries between customer-related measures and the data minimisation principle. This project has shown the Norwegian Data Protection Authority that it can be challenging for us as a supervisory authority to provide clear recommendations and guidance on good data protection in the effort to combat money laundering without involving the FSAN. It has therefore been natural to consult the FSAN about relevant matters relating to the interpretation and practising of the anti-money laundering regulations as the project has progressed.

The FSAN also participated in one of the sandbox project’s workshops as an observer. However, it is important to make clear that this report expresses the Norwegian Data Protection Authority’s assessments and views. The FSAN has evaluated whether the representations in the Anti-Money Laundering Act are incorrect but has taken no position on the factual descriptions or Finterai’s assessments regarding the regulations. The FSAN has not been involved in the preparation of this report.

Finterai and the FSAN have, in parallel with the sandbox project, also engaged in a dialogue on a number of issues relating to the interpretation of specific provisions in the Anti-Money Laundering Act. This dialogue has primarily aimed to clarify whether the Anti-Money Laundering Act places restrictions on the types of information that may be shared between reporting entities. The FSAN has replied to these questions in a letter sent directly to Finterai.

Objective of the sandbox process

Based on Finterai's use of artificial intelligence (AI), the Norwegian Data Protection Authority and Finterai have jointly identified three problematic areas, where federated learning in this project challenges the data protection regulations and the data subjects’ privacy. These issues may also be relevant for other developers and enterprises wishing to use federated learning:

  1. What roles and responsibilities does Finterai assume in the use of the federated learning model?

    What personal data processing activities take place in connection with federated learning, including standardisation as a preparatory step? What responsibility on the part of Finterai does this trigger?
  2. How can Finterai facilitate data minimisation relating to the processing of transaction data and third-party data?

    When Finterai sets limits for the data categories all the participants must have access to, what impact does this have on data privacy?
  3. Can Finterai safeguard data privacy when sharing machine learning algorithms?

    How can Finterai evaluate any breaches or vulnerabilities in the solution? How can an acceptable risk level be identified?

Legal basis for processing personal data

All processing of personal data requires a legal basis to be lawful.

Article 6(1) (a–f) of the General Data Protection Regulation (GDPR) contains an exhaustive list of six legal bases for the lawful processing of personal data.

In this sandbox project, we have taken no position on whether banks have a legal basis for processing personal data in the artificial intelligence (AI) tools that Finterai is offering. This applies both to the use of AI tools as part of the banks’ anti-money laundering activities and any use of personal data to train the algorithms. Nor have we taken any position on whether Finterai has a legal basis for processing personal data, should this become relevant.

The discussions in this sandbox project presume that the data controllers, whether Finterai itself or the banks, find a legal basis for processing personal data when using and further developing this service. If not, it will not be possible to use the service legally.

We assume that the banks’ obligations and potential leeway under the anti-money laundering regulations will be a natural starting point for an assessment of the legal basis for processing personal data for the purpose of uncovering suspicious transactions (Article 6(1)(c) of the GDPR). In some of our discussions, therefore, we have referred explicitly to this regulation, without this implying we have taken a position on whether the banks’ use of the service may be authorised pursuant to the anti-money laundering regulations.

If the processing cannot be authorised under the anti-money laundering regulations, the banks must themselves identify another legal basis for doing so. Legitimate interests, (Article 6(1)(f) of the GDPR) will probably be the most relevant alternative, although we have taken no position on that here.

Which personal data are processed?

SWIFT messages

SWIFT messages are made up of transaction data. SWIFT is an international payment network for the transfer of funds between banks not located in the same country. SWIFT messages may contain personal data if an individual is the sender or recipient in the transaction.

“Know your customer” (KYC)

Financial institutions are obligated to collect information about their customers (including their identities), the purpose of the customer relationship and its intended nature, which of the reporting entity's products and services they use and the source of the funds, etc. This is known as the “know your customer” principle and is used to classify customers in various risk categories and to verify that transactions are performed in accordance with the information collected.

Not all categories of KYC data necessarily contain personal data, but some categories do. KYC data is obtained both from the customers themselves and from publicly available websites and third-party suppliers offering this type of data as a paid service. Data obtained from publicly available websites and third-party suppliers is referred to as third-party data.

Third-party data

Third-party data is used to add new information or verify information provided by customers themselves. This could, for example, be information that someone is a politically exposed person (PEP), that they are on a sanctions list, that they have been the subject of negative media reports, or information from other public sources relating to criminal offences, litigation, etc. Third-party data may include datasets that are compiled from a variety of data sources. Third-party data is often collected through different platforms and websites, which are then aggregated by a data supplier. Third-party data does not always contain personal data.

The assessments in this report relate solely to the processing of data considered to constitute personal data.

Roles: What responsibility does Finterai have?

A key question in the sandbox project related to the distribution of responsibility between Finterai and its customers.

The background to this question was twofold. 1. Federated learning requires all those involved to have the same datapoints in the same format. Finterai wanted to discuss how much leeway it had to facilitate such a standardisation without assuming the responsibilities of data controller.

Second, there is a slight possibility that the models contain personal data. Finterai therefore wanted to discuss what consequences this might have for the company when they check the quality of the models.

In the sandbox, we jointly identified various personal data processing actions that occur in or are presumed by Finterai’s processing. We selected three different processing activities to examine in more detail:

  1. The standardising of transaction data
  2. The collection of third-party data
  3. The checking of vulnerabilities in the model

We elected to examine these processing activities in further detail because they are key processes in, or are presumed by, Finterai’s solution, and because they are processes that are also relevant for comparable enterprises or those with an obligation under the Anti-Money Laundering Act.

The Norwegian Data Protection Authority has not come to a conclusion on the roles Finterai and the banks play respectively with regard to the three selected processing activities. This is because we have not taken a position on what could potentially be the legal basis for Finterai or the banks’ processing of personal data. The discussions in the sandbox have therefore primarily sought to identify relevant factors in the assessment of roles, based on specific processing activities, and determine the direction in which the various factors lean. We emphasise that the Data Protection Authority’s role assessments are merely advisory. Finterai and the banks must themselves come to a conclusion about their own roles on the basis of an overall assessment of the actual circumstances.

1.   Standardising transaction data

To make the banks’ local data compatible with Finterai's federated learning solution, the data must be standardised and structured before they can be used. Finterai provides a software solution that uses AI-driven, natural language processing. The objective of the software is to standardise data, including SWIFT messages, so that they have the same format as equivalent data from other banks which also want to participate in the federated learning process.

In the sandbox, we discussed roles relating to the standardising of SWIFT messages. However, the factors to be considered are also relevant for an assessment of roles relating to the standardisation of other categories of data, such as KYC information and third-party data.

The most relevant purpose for standardising data in this context will be to adapt the data’s format so they can be used in a federated learning process across the participating banks. Each bank decides for itself whether it wants to use Finterai’s federated learning service. Furthermore, each bank decides for itself which method it wants to use to convert the data to the format required for participation. It is thus not a requirement that they exclusively use Finterai's standardisation software.

Furthermore, the software is installed and run within the bank’s own IT environment, without Finterai or the other participating banks having access to the data being standardised. The software’s algorithms are also trained internally at the individual bank, and it is the bank itself which is responsible for this process. The result of the training will not be shared with Finterai or the other banks which use the same software.

The factors highlighted above indicate that each bank has a decisive influence over both the decision to employ federated learning as part of their anti-money laundering endeavours and the means it will use to achieve this end.

In our sandbox discussions, we have not identified any purpose related to the banks’ use of the standardisation software where Finterai has a decisive influence. As previously mentioned, Finterai will not have access to the personal data processed by the banks or the results of the processing (standardised data or learning from AI-driven natural language processing).

On this basis, the Norwegian Data Protection Authority deems it highly unlikely that Finterai has a decisive influence over the purpose for or means by which the data is processed. In that case, Finterai will not be accorded the status of data controller. Since Finterai does not process personal data on behalf of the banks, it will not have the status of data processor either.

2.   Collecting third-party data

It is not merely the data’s format that must be the same across the entities wanting to use federated learning. The data categories used must also be the same. This is to enable the banks to access the same type of data and to allow input data to be interpreted by the model.

To ensure that all the banks using the service have access to the same data categories, Finterai must establish a framework for the types of data that may be used in the federated learning system. In addition to data from SWIFT messages and KYC data, which are obtained from the customers or, potentially, the transaction counterparties, Finterai requires the banks to collect what it calls third-party data. This will often contain personal data.

In the sandbox, we discussed what consequences it may have for Finterai’s role that it defines the data categories which participants in the federated learning process must have access to. We have presumed that the banks are already in possession of the SWIFT messages and KYC data. The discussions have therefore primarily related to the collection of third-party data.

The purpose of standardisation is to ensure that the participating banks have access to the same type of data. However, there is another purpose which is crucial for the choice of data categories, namely that the banks should be able to build up and train models that are well suited to uncovering suspicious transactions.

It is the banks that are responsible for complying with the anti-money laundering regulations. Finterai has no independent responsibility under this regulatory framework. As discussed in the paragraphs concerning the standardising of transaction data, it is the banks themselves that have decisive influence over whether they use Finterai's federated learning solution as part of their anti-money laundering endeavours. There is no requirement for the banks to use Finterai’s software for the collection of third-party data.

Each bank decides for itself whether it wants to use Finterai’s service and therefore which categories of third-party data it must subsequently collect, and by what means. If a bank disagrees with Finterai about the data categories that need to be collected for this purpose, it can choose not to use Finterai’s solution as part of its anti-money laundering endeavours. In other words, it can be argued that it is the individual bank which has decisive influence over the means to be used to achieve the intended purpose.

Data categories may be said to be essential means (which and whose personal data are to be collected), or the “core” of how the purpose is to be achieved. Furthermore, the data will not be shared with Finterai. Who has decisive influence over which categories of third-party data that must be collected (means) in order to train models that are well suited to uncovering suspicious transactions (purpose) is an important factor in the assessment of roles.

The factors that have been highlighted above indicate that it is the individual bank itself that has decisive influence over the purpose and means of third-party data collection. We have, however, also looked at any self-interest that Finterai may have in the decision about which data categories are to be used in the solution.

In the assessment of roles relating to the standardising of transaction data, we emphasised that Finterai will have access neither to the personal data being processed nor the results of that processing. With regard to the collection of third-party data, Finterai will have no access to this type of personal data, although it will have access to the models that have been developed with the help of that personal data.

These models will also be available to all the banks participating in the federated learning process. It could be argued that the better these models are, the more attractive Finterai’s service will probably become. The banks’ access to models that are adequate and effective in their anti-money laundering endeavours could therefore be perceived as promoting the sale of Finterai's service, thereby giving the company a commercial advantage. The purpose of choosing which data categories to include in the service is, however, to enable the banks to build effective AI models with which to fulfil their obligations under the anti-money laundering regulations.

Any purely commercial advantage that Finterai may be imagined to derive here is, pursuant to the European Data Protection Board's guidelines, not in itself sufficient to qualify as a purpose with respect to processing. This would imply that Finterai does not have decisive influence over the purposes mentioned above, with the result being that data controller responsibility is not triggered on Finterai’s part.

Read the European Data Protection Board’s guidelines relating to the roles of data controller and data processor (07/2020)

It is also important to consider whether there is a legal basis for the collection and use of third-party data in this way before the data are collected and used. Although the matter of legal basis is not addressed as a separate topic in this report, we would nevertheless like to remind readers of it. The data minimisation principle must also be considered with respect to the collection and use of third-party data. This will be discussed below in the chapter on data minimisation.

3.   Checking for vulnerabilities in the model

The final processing activity we discussed relates to the checks that Finterai must perform in connection with the execution of federated learning.

The models shared between the banks and Finterai are, in principle, intended to be devoid of personal data. There is nevertheless a certain risk that personal data on which a model has been trained could be extracted if the model has vulnerabilities. This is referred to as data leakage and means that it is possible to re-identify individuals. Such vulnerabilities may arise if an error has occurred during the training process.

Finterai plans to perform a variety of checks on the models to be trained using federated learning. One check that all the models must undergo after training and before being shared with other participants is meant to uncover any probability of data leakage from the model. If Finterai's check uncovers vulnerabilities that could lead to data leakage, processing of the model in question could be deemed the processing of personal data.

There is reason to assume that such errors will arise from time to time. The question raised in the sandbox was Finterai's role in the event that it, in practice, ends up processing personal data due to errors made by the banks.

The purpose of vulnerability checks is to prevent models containing personal data from slipping into the federated learning system. It might be thought that each bank could perform this check itself before sending the model to Finterai. However, there are good reasons for this checking process to be performed by Finterai. This includes avoiding quality control being dependent on the competence of the individual bank.

Finterai has disclosed that any models with vulnerabilities will be deleted and the bank from which the model came notified. By checking the models for weaknesses, and excluding models containing errors, Finterai is helping the banks to prevent personal data from going astray.

On the basis of the above, we consider it likely that Finterai will become the data controller for any personal data uncovered in the checks performed. In truth, the checks must be said to have been performed on behalf of the bank with which the model last trained, while Finterai acts as data processor for the banks to the extent that personal data are processed as part of the checks. It would be natural to include guidelines in the data processor agreement setting out the measures Finterai and the banks must implement if an error occurs and Finterai gains access to personal data.

For models that do not contain personal data, neither the checks performed, nor the further sharing of the models will constitute processing in the meaning of the GDPR, since personal data are not processed. The GDPR will therefore not apply in such cases. If there is any doubt about whether or not the model contains personal data, it should be treated as though it did contain personal data and the GDPR were applicable.

Data minimisation

The discussions in this chapter relate to how Finterai can facilitate data minimisation in its service.

The development of artificial intelligence (AI) often depends on vast quantities of personal data. However, the Data Minimisation Principle requires that the data to be used are adequate, relevant and limited to those necessary to achieve the purpose for which they are being processed. This means that a data controller cannot use more personal data than are actually necessary to achieve the purpose, and that the data must be deleted when they are no longer needed. Furthermore, the data minimisation principle means that the data selected must be relevant for the purpose.

Read more about data minimisation

In the Norwegian legal commentary to the GDPR (Skullerud et al.), it is pointed out that the requirements for adequacy and relevance mean that the personal data processed must “have a close and natural link to the purpose for processing and be suitable for that purpose”. The data minimisation assessment is irrevocably tied to the purpose for processing.

Read the legal commentary on the GDPR (juridika.no)

The data controller is responsible for upholding the data minimisation principle. A software supplier which, after a specific assessment, is not deemed to be the data controller, will in principle not have any direct responsibility for upholding the data minimisation principle. Nevertheless, it is important that the software delivered makes it possible for the data controller to comply with the regulations in practice. In the opposite event, the supplier's customers will not legally be able to use the software for the processing of personal data. It is therefore important that Finterai takes a conscious approach to data minimisation in the development of its service, regardless of whether or not the company is deemed to be the data controller.

In the sandbox, we discussed how Finterai’s service could affect the volume of personal data that the banks use in their effort to uncover suspicious transactions, and what Finterai could potentially do to facilitate data minimisation. The discussions focused primarily on the collection of third-party data that do not come directly from the transactions but that are collected from parties other than the customers themselves. Such data could, for example, have been published in the media. Third-party data does not always contain personal data. The assessments in this report relate solely to the processing of third-party data considered to constitute personal data.

Relationship with the anti-money laundering regulations – challenges for standardisation

We have not assessed legal basis in this sandbox project. The banks' obligations to assist in the fight against money laundering are, however, set out in the Anti-Money Laundering Act and associated statutory regulations. It would therefore be understandable to assume that the banks, if they want to process third-party data for the purpose of uncovering money laundering, must find a legal basis for such processing in the anti-money laundering regulations.

The anti-money laundering regulations are risk based. This means that each bank’s obligation to collect information pursuant to this regulatory framework depends on the risk the individual customer represents within the bank in question. The same customer may constitute a different risk for one or more different banks. In addition, each bank’s customer base will be made up of customers to which different levels of risk are attached.

This risk-based approach in the anti-money laundering regulations could create problems for Finterai's desire to standardise the data categories that the banks must use in the federated learning process. The data minimisation principle means that the banks cannot process more personal data than are necessary to fulfil the purpose. One question that emerged in the sandbox was whether it is possible, within the limits of the prevailing anti-money laundering regulations, to find a minimum level of data that may always be collected, irrespective of risk, and that may therefore be included in a standardisation process.

It is the Financial Supervisory Authority of Norway (FSAN) which monitors the reporting entities’ compliance with the anti-money laundering regulations, and an interpretation of this regulatory framework falls outside the remit of the Norwegian Data Protection Authority's sandbox process. We are therefore unable to answer this question. However, all processing of personal data requires a legal basis, and the discussion below therefore presumes that it is possible to establish a minimum level of data that may always be used in connection with anti-corruption activities, regardless of risk.

Data minimisation and federated learning – the need for predefined data categories

Some banks already collect third-party data in connection with their anti-money laundering endeavours. However, the banks differ in the kind of data they collect. In order for federated learning to work as intended, it is necessary to coordinate which data categories the banks process. This is to enable a model developed in Bank A to be trained in Bank B and Bank C. These subsequent banks must have access to the same data categories as Bank A used when the model was developed.

The banks which participate in federated learning must therefore have access to the same categories of personal data. However, the need for each category of personal data only arises when a bank builds a model that uses the personal data concerned. Nevertheless, it may be assumed that some types of data, for example the data contained in SWIFT messages, will always be relevant. Other categories of personal data are used more rarely or potentially not at all. At this point, the issue of data minimisation arises. Is it in line with the data minimisation principle to collect personal data without knowing, at the time they are collected, whether they will be needed? This issue will probably also be relevant to a greater or lesser degree for other entities that use federated learning on the basis of personal data.

In the sandbox, we discussed different ways of adapting Finterai’s service to enable the banks to avoid collecting the various personal data categories until there is an actual need for them.

The most realistic alternative discussed was for the banks to collect necessary third-party data only once they have decided to develop a model that includes those data, or when they receive a model for training that requires the particular data concerned. Finterai has pointed out that such a solution could be technically challenging, and that it will lead to delays in the training process.

If it turns out that the banks have collected personal data that there is never any need for, the data concerned cannot be said to be necessary for the specific purpose. The Norwegian Data Protection Authority therefore proposes that the system be rigged in such a way that the banks can hold off from obtaining personal data until they know for sure that they will have a use for them. In this context, it is important to underline that the Norwegian Data Protection Authority’s contributions must be considered as guidance and do not constitute an assessment of the legality of Finterai’s planned service.

Data minimisation in artificial intelligence (AI)

According to Finterai, the models the banks use today to uncover suspicious transactions are too weak. The company has a theory that the banks need more datapoints in their models to do a satisfactory job of uncovering suspicious transactions, something it wants to facilitate in its service.

Use of artificial intelligence (AI) enables systems to be built that can learn, find connections, conduct probability analyses and draw conclusions far beyond the capacity of both humans and systems that do not use AI. This means that AI-based systems could increase the quality of the banks’ anti-money laundering efforts. These systems will probably find connections in data that have not traditionally been used in anti-money laundering endeavours and that are not initially considered to have a close and natural link to the fight against money laundering.

However, the banks do not always know the extent to which various third-party data will help to uncover attempts to launder money (the purpose of the processing) until they have tested the data over time. This could prove a challenge. If, after testing, it should prove that one or more categories of personal data are of little or no significance for the achievement of the purpose, those data will not meet the requirement for relevance. In that case, continued processing of those items of personal data would quickly contravene the data minimisation principle.

But what about that processing of the personal data concerned which took place up until the point at which the bank (or Finterai) discovers that they are not sufficiently relevant to achieve the purpose? Would that also contravene the data minimisation principle? These are questions we have discussed in the sandbox but have not found clear answers to. As with much else, the answers will depend on a specific assessment.

However, there are certainly no grounds to say that the processing of personal data which subsequently prove insufficiently relevant to achieve the purpose is always in breach of the data minimisation principle. In any assessment of this, it is relevant to look at the reason why such items of personal data were selected in the first place. For example, were the items of personal data selected at random, or was the selection based on relevant and legitimate assumptions?

Furthermore, it is important to be aware of the risk that an assumption may be wrong and to have effective measures in place to verify the relevance of the personal data being used. The longer it takes before picking up on and halting the processing of personal data that prove to be insufficiently relevant, the greater the risk that the processing contravenes the data minimisation principle. These issues are not unique to Finterai. They are something that everyone using AI-based tools to process personal data should pay particular attention to.

Security challenges

No specific security-related assessments were made in the sandbox of Finterai's solution. However, we have identified what we consider to be the most important overarching threats and opportunities for Finterai's solution.

Use of federated learning has both strengths and weaknesses when it comes to information security and the protection of personal data. One of the most important strengths of this technology is that federated learning does not require the sharing or aggregation of data, including personal data, across multiple entities.

At the same time, it requires that the outcome of the training, i.e. the actual machine learning model that includes the parameter sets, be shared between the entities to create a joint model. Both the sub-models and the joint model could hypothetically be subjected to “model inversion attacks”, in which the original data – including personal data – could potentially be reconstructed.

Solutions architecture

Irrespective of the properties of federated learning, specific choices relating to solutions architecture and solutions design will naturally have an impact on its vulnerability surface. No specific assessments of Finterai’s choices with respect to its solution have been made. We merely present an overarching description of the issues discussed in the project.

Machine learning often presumes large volumes of data, generally combined with specialised software and hardware, and is frequently achieved through the use of cloud services. The use of cloud services in general has not been assessed in the project. The use of such a relatively new method and technology as federated learning creates both challenges and opportunities.

Challenges precisely because the method is new and all the potential vulnerabilities of the algorithms, procedures, tools and services may not yet have been adequately identified. Opportunities because federated learning is well suited to mitigating the classic security challenges, especially by reducing the need to transfer, share and aggregate large volumes of data. Where cloud services used for AI often rely on the uploading and aggregation of data for central processing, federated learning enables decentralised and localised data processing.

Bad actors

Finterai is a start-up company with limited resources to deal with external cybersecurity threats from bad actors, despite the solution and its tools being realised in a modern cloud solution. Cloud solutions offer many potential security functions, but this requires competence and resources for their day-to-day operation, in addition to the operation of the core solution. One important measure for reducing general cybersecurity threats that we discussed in the sandbox was to minimise the volume of assets that need to be protected.

In Finterai’s case, this is achieved by not uploading and aggregating data. Each participating bank processes its own data in its own systems and protects them through its own requirements, resources and capabilities. Only the fully trained sub-model packages are uploaded to the central part of the solution for centralised processing, coordination and verification.

The sub-models are protected via encryption (confidentiality) and signatures (integrity) during transfer and storage. The practical execution and efficacy of encryption depends on the measures and architecture that Finterai chooses to apply. During the actual training process, the sub-models will be decrypted. At this point, participants are able to use their own requirements, resources and capabilities to achieve a desired and suitable level of security. Specific methods and tools for this have not been assessed.

Internal bad actors generally take the form of disloyal employees, who may have privileged access to internal processing systems. In this case, Finterai’s solution has the same challenges as other information systems, with requirements for access control and authorisation of users. Because of federated learning's decentralised nature, disloyal employees of one participant are restricted to accessing their own data and not other participants’ datasets. Disloyal employees of Finterai itself do not, in principle, have access to customers’ datasets.

The Norwegian Data Protection Authority’s guide Software development with built-in data protection contains general guidelines for risk assessment relating to information security as part of the solution's design.

Accessibility

The solution's accessibility is safeguarded partly by its inherently decentralised nature. Local model training may, in principle, be carried out independently of the central service. The same applies to the production phase, in which the banks use the solution for its primary purpose – to identify potential money laundering transactions. All these activities take place using Finterai's tools but are run on the individual client’s own platform/infrastructure.

Further development, follow-up learning, and the sharing of learning outcomes presume access to the central services. These central services are only to a limited extent important for the system's day-to-day operation and operational accessibility because they are not part of the primary service production.

Attacks on machine learning models

All machine learning models that are trained on a dataset, including those which do not use federated learning, may be subjected to attack. One objective for such an attack may be to reconstruct data used for the model's training, including personal data. According to academic literature about federated learning and security challenges, model inversion is considered to be a particularly relevant risk, due to the systematic sharing of machine learning models between multiple entities. Federated learning may be particularly vulnerable to attacks that threaten the model's robustness, or the privacy of those whose personal data is stored by the banks.

Read the research article "Privacy considerations in machine learning"

Attacks on the model may occur during two phases: the training phase or the operational phase.

  • Training phase: Attacks in this phase may teach, influence or corrupt the model. The attacker may also attempt to influence the integrity of the data used to train the model. During this phase, an attacker can more easily reconstruct local data held by the various participants.
  • Operational phase: Attacks in this phase do not aim to alter the actual model but to alter the model’s predictions/analyses, or to gather information on weighting factors in the model. If an attacker learns about the model's weighting factors, it is hypothetically possible that this may be used to reconstruct (completely or in part) the local data (which may contain personal data) on which the model is based.

The research literature describes several measures to prevent such attacks. The most important measure is to use models and algorithms that are presumed to be robust against attack. Other potential measures include use of methods such as Differential Privacy, homomorphic encryption or Secure Multiparty Computation. No specific assessment of individual measures was undertaken as part of this sandbox project. Some of the existing security measures have the disadvantage of introducing noise to the model, have not been extensively tested in practice and can therefore reduce the model's precision or burden the system with high computation costs.

Read the research article “A Survey on Differentially Private Machine Learning”

Use of model inversion depends on a number of conditions being in place. Firstly, (especially for external actors) access to trained model packages that are, in principle, protected through encryption, access control, etc. Next, the actual execution of a model inversion attack requires both extensive and specialised skills. The data to be reconstructed may comprise just a part of the original dataset and may be of variable quality.

In addition, this type of attack depends to a large extent on the target algorithms being vulnerable to such attacks. The algorithms that Finterai could so far envisage using in its federated learning system are not considered to be particularly vulnerable to model inversion, since the basic mathematical principles underpinning the model are presumed not to allow such attacks. The algorithms used will not be openly accessible to external actors. Internal actors (typically participating banks) who systematically gain access to each other’s unencrypted training models will also, in principle, have reciprocal contractual obligations.

All in all, therefore, there are significant obstacles and costs to this type of attack for external bad actors, also in the context of Finterai’s solution. For its part, Finterai asserts that its federated learning system is no more vulnerable to attacks on personal data than machine learning models generally are. Federated learning is, nevertheless, a young technology and there could be further vulnerabilities that have not yet come to light. This lack of knowledge makes performing an accurate risk assessment challenging.

Going forward

In the sandbox, Finterai and the Norwegian Data Protection Authority have explored data protection issues relating to the development of an anti-money laundering solution based on federated learning. This report is not an exhaustive discussion of the questions federated learning raises with respect to the data protection regulations, and the Norwegian Data Protection Authority would like to highlight Privacy by Design as an area for further deliberation.

Companies which take data protection seriously build trust. Article 25 of the GDPR requires that enterprises take account of the fundamental principles relating to the processing of personal data in all phases of the lifecycle of a software program that processes personal data, such that the data subjects’ rights and freedoms are upheld. Data protection shall be integrated into the technology, included in the planning phase of the solution's development and be the default setting. The safeguarding of data privacy shall also be a natural part of the development process and not something added on at the last minute.

The sandbox project has not had the capacity to explore in depth what Privacy by Design means in the context of machine learning based on federated learning principles, nor has any conclusion been reached as to whether Finterai meets the requirements in Article 25 of the GDPR. Finterai’s federated learning solution may be an inherently more privacy-friendly technology compared to more “traditional machine learning models”, because the method allows participants in the federated learning system to learn from each other's data without actually sharing data. It is precisely this built-in restriction on the further sharing of local data that makes the technology more privacy friendly.

Nevertheless, Finterai must meet the requirements in Article 25 of the GDPR to be relevant for customers who are obligated to choose solutions with Privacy by Design. It would be very useful to explore other technical and organisational initiatives which could effectively build in data protection during the development of the solution.

The Norwegian Data Protection Authority considers that the interface between the data protection regulations and the anti-money laundering regulations should be subject to further examination. At present, it is uncertain how the relationship between the data protection and anti-money laundering regulations affect which data enterprises may collect and use in their anti-money laundering endeavours.

Federated in other fields

Going forward, it will be relevant to monitor new areas of application for federated learning. The method is generally useful when:

  1. There are few examples of at least one class of data.
  2. Opportunities for data sharing are limited.
  3. Cooperation is necessary.
  4. There is little relevant data.

The battle waged by insurance companies against insurance fraud has many similarities with the banks’ anti-money laundering endeavours and is therefore an obvious field for the application of federated learning.

Solutions that learn from official register data may also be relevant for the method. In Norway, we gather vast amounts of information about private individuals in a variety of official registers. For example, Norway’s health data is considered to be among the best in the world (www.ehelse.no). This provides unique opportunities to develop accurate and effective solutions, as well as study connections in areas as varied as the high school drop-out rate, pension schemes and public health. This wealth of information also comes with privacy dilemmas, because those who process register-related data could potentially re-identify individual people. However, with federated learning, different entities can train the same algorithm on their internal datasets, while data from the original source are not shared.

In Norway, we need a better understanding of privacy-friendly technology. That companies like Finterai wish to take the lead, and openly explore their solution in the sandbox, helps to lower the risk threshold associated with developing new AI-based solutions and also provides experience of how such technology works in practice. We hope the sandbox project’s assessments will contribute to innovation through the secure sharing of data and make it easier for developers to comply with the GDPR's requirements.