Through this article, you can learn what data sovereignty is, the technical challenges data sovereignty brings, and how Airwallex makes intelligent routing solutions to solve the above problems with the help of Apache APISIX.
As a global fintech company, Airwallex has built a global financial platform, with its payment network covering more than 50 currencies in over 130 countries and regions worldwide, and it provides digital fintech products for enterprises. Airwallex serves users in various regions, which naturally involves the problem of data sovereignty.
What is Data Sovereignty?
Data sovereignty refers to state sovereignty in cyberspace, reflecting the dominant position of a state in controlling data rights.
Before elaborating on data sovereignty, let’s talk about GDPR first. GDPR ( General Data Protection Regulation ) is a regulatory document formulated by the European Union, which is a regulation for the privacy and protection of personal data. The basic requirement in GDPR is that all the activities involving user data collection should be on the premise of users’ consent, and it should be guaranteed that users can clear their data.
It is worth noting that when Airwallex intends to migrate the data from Europe to other regions, it has to ensure that the requirements of third-party countries for data sovereignty are in line with that of the EU’s.
There is another example. The USA PATRIOT Act requires that all the data stored within the United States or by U.S. companies are all under the supervision of the U.S. The U.S. Department of Justice and CIA have the right to ask these companies to provide these data.
In 2013, after the event of 911, the U.S. Department of Justice asked Microsoft to provide some email information that was stored on the servers in Ireland. At that time, Microsoft declined the request on the grounds that it would violate the regulatory requirements of EU. Then the U.S. Department of Justice took Microsoft to court, but Microsoft won the case in the end.
Later, many companies of the United States directly located their data centers in Europe to avoid the problems of data sovereignty, thinking it would be safe that way. But in some recent cases, judges ruled that the U.S. still has the authority to request data from U.S. companies in Europe.
Data sovereignty poses a significant challenge for the global business of Airwallex.
Data Sovereignty Business Challenges
The figure above shows a single data stream from a multinational company. Without the requirement of data sovereignty, data can be placed in Europe and synchronized to Asia or any data center in the world. When it is necessary to interface with VCR, putting the data in a certain region allows you to encapsulate all the business into one service system.
But in an age where data sovereignty is valued, this just doesn’t work. The flow of much data is under control and the previous architecture cannot be adopted. In Europe, only European data can be processed, and similarly, only Asian data can be processed in Asia.
At this point, you may easily think of a solution that we only store user data in users’ home country and do not allow data to be stored in multiple regions. But this would not allow the service to be completely stateless and the data can only be processed where it is. This architecture is quite simple, but the vast majority of scenarios in the actual business is not the case. The interaction between clusters is unavoidable. Although the data in each region is completely independent, there are scenarios where accessing the data of another cluster is required in the operation of business.
For example, Amazon has a U.S. site, German site, China site, etc. However, the data between each of its sites is not interconnected. For example, if you buy some books in a Kindle bookstore in the U.S. and try to sync the books over with your Chinese account, this is not supported, as the data is completely segregated between different regions in Amazon. If a user only accesses with a Chinese account, then all requests will not be out of the Chinese data center.
This approach is very efficient in the beginning because Amazon leaves this problem to the users, which means the users should manage the problems themselves. However, if a user has multiple accounts of different countries, these different accounts cannot be synchronized with each other, which is quite inconvenient for users. Amazon would not necessarily have adopted this solution if it had thought through this architecture in the beginning.
For another example, a multinational group has many subsidiaries in various countries, and each of them store data in a different country according to the requirements of data sovereignty. For the finance staff of multinational groups, when managing the financial data of the whole group, they have to switch between the data of each country, such as Europe, Asia, etc. This is very inconvenient for customers with global business.
Intelligent Routing Solution of Apache APISIX
In these complex business scenarios, Airwallex chooses Apache APISIX to make an intelligent routing solution, and it is the Apache APISIX gateway that decides where the data should be processed.
The gateway is composed of two layers. The first layer is responsible for routing requests, determining which data center the request should reach based on the conditions. The second layer takes charge of traffic forwarding.
The gateway is to solve these two problems: 1) which data center is each request routed to? 2) How is the traffic forwarded?
The information about the flow is divided into two categories.
- Registration: When a user registers for the first time, the information is incomplete and we don’t know which data center this user’s registration data should be placed in.
- Static resources: The static resources of all the data centers must be identical.
Request with the identity of the user
- Logging in: When a user logs in, it indicates that the registration has been completed, and the data center is known at this point.
- Password resetting: You can check reversely where the data is by username, mobile number, email, city, etc. and then resend the request.
Dynamic routing of various data centers
Next, we’ll talk about how Airwallex goes about dynamic routing on various data centers.
Login and password reset
When logging in, we can obtain the username and password, but the password can not be used as identifying information, so we can only query based on the username to determine which region the user belongs to. For the business, it is necessary to design a data store that can synchronize data globally. For example, if a user registers an account in China, we can transform the data into a Kafka message through CDC (Change Data Capture) and receive the message through a specific listener. Then the further transformation is conducted, such as removing personal information like username, email, etc., which can not be stored across the border. The user requests are processed at the gateway layer based on Apache APISIX by salted hashing.
Business operations in complex scenarios
Business operations are complex. When manipulating a piece of data, how can you decide where to perform the operation? Let’s start with the regular business operation, such as checking accounts or history. There are two modes in this case: a stateful mode and a stateless mode.
Session is used for a stateful mode. After a user logs in, the server sends a cookie with the session ID to the user. When requesting, the Apache APISIX-based gateway layer queries the user’s region through the information in the cookie. Even if a user changes the server, he or she can still stay logged in and know from where to get the data. For example, a user who is traveling initially logs in in Europe, and then arrives in Asia by plane. When logging in in Asia, the user can determine which data center to access through the session, send the request to the corresponding data center, and perform subsequent business operations in the corresponding data center.
Stateless mode, such as API access. When API is accessed, it is not appropriate to pass the session ID through a cookie. We use a special token that contains the information where the data center is located, and Apache APISIX decides which data center to access based on the token. In this way, the dynamic property can be maintained when expanding the business. If the initial design is static, it will be very difficult to solve the cross-data center scenario in the future as the data center is determined based on the information at the time of registration, and users can not access a data center based on their business location.
The registration is also complicated because which data center the registration data can be placed in is only based on the registration information filled by the user. But if a user emigrates or the company migrates to another place, it is necessary to do data erasure and migrate the user’s transaction data, username, password, etc. to another data center, which is actually quite costly. We now support users to switch data centers for such complex scenarios, but we should also consider how to minimize the impact of switching data centers on the entire architecture.
When we do some business BI or big data analysis, it is definitely impossible to make use of the user data directly. The sensitive information should be filtered before we use it. Besides, the data should be aggregated. That is, the data from each region should be aggregated together to get the overall results.
Meanwhile, the user information should also be abstracted. To meet the regulatory requirements, it is necessary to ensure that users’ status cannot be fully identified before the data is used for analysis.
Yang Li, PhD, Committer of Apache APISIX and Technical Platform Lead of Airwallex, is responsible for the evolution of the company’s technology platform. Prior to joining Airwallex, he worked at Wanxiang Blockchain and was responsible for the blockchain alliance. Before that, he took charge of the risk control platform of OTC derivatives at Citigroup.