Snowball Finance Transforms Its Active-Active Architecture with APISIX
Wenjie Shi
April 28, 2023
Wenjie Shi, Senior Development Engineer of the Infrastructure Team at Snowball Finance, shared the practice of Snowball Finance with APISIX at the Apache APISIX Summit ASIA 2022. This article is based on the sharing, which introduces how Snowball Finance leverages Apache APISIX to achieve the evolution of its internal active-active architecture, thereby achieving more flexible services management.
Challenges
- Complex SDK authentication modules increase system complexity and security risks when the user center is accessed across regions due to the active-active architecture being only available in the market service module
- OpenResty lacks a robust monitoring system for observability and needs customized scripts to achieve scalability, resulting in higher development and operation costs
- An incomplete NGINX registry center with no heartbeat mechanism lowers availability and stability, making it unable to promptly handle system failures
Goals
- Minimize changes without introducing too many variables while maintaining transparency for business groups
- Handle problems uniformly at the infrastructure level and strive to complete authentication services within the local data center
Results
- Implemented the unified authentication, circuit breaking, and rate limiting at the gateway layer, reducing system coupling and improving service quality in dual data center scenarios
- Established a unified monitoring solution from the gateway to the service layer leveraging the APISIX monitoring system and provided excellent support for global troubleshooting
- Provided an elegant implementation approach for gRPC protocol conversion and service management
Background
Founded in 2010, Snowball Finance started as an investment community and has now become a leading online finance management platform in China, offering various services including high-quality content, real-time market service, trading tools, and wealth management to investors.
Among these services, the real-time market service is connected to multiple upstream data sources and provides stable data services to investors through data streaming, storage, and distribution. Therefore, real-time market services have always been a major resource consumer in Snowball Finance's system.
Consequently, one important task within Snowball Finance is to improve stability continuously, including performance optimization of market services. Even so, during traffic peak periods, some systems may still experience slow response or even become unavailable due to a surge of data, affecting the user experience.
Against this background, Snowball Finance has launched a service active-active transformation plan to provide stable and high-quality services to investors, where Apache APISIX dramatically simplifies the implementation. In addition, as a cloud-native API gateway, APISIX has an active community and rich ecosystem, supporting multiple plugins. These advantages also provide a good foundation for the evolution of Snowball Finance's cloud-native architecture.
Pain Points of Active-Active Transformation
At the time when it was applying the standalone architecture, user traffic entered through the Server Load Balancing (SLB), went through the gateway for simple common logic processing, and was then forwarded to the backend service. The backend service, through an integrated authentication module, initiated user authentication with Snowball User Center using an SDK and continues with subsequent processing upon successful authentication.
In practical business scenarios, some pain points have gradually emerged.
1. Complex SDK Authentication Modules
When implementing active-active transformation, the provider and consumer of microservices cannot be deployed and launched synchronously. When Snowball Finance's market service was first launched on the cloud, but the user center has not yet been supported on the cloud, cross-data center calls occurred. According to statistics from the user center, its RPC call reached around tens of billions per day, and the peak volume can reach 50,000 QPS, which can result in higher latency.
At the same time, Snowball Finance's authentication logic was complex. In addition to the OAuth2.0/JWT protocols, many factors needed to be taken into account, such as the client versions and multiple APPs under Snowball Finance. Besides, the authentication module was embedded in the service thus upgrading became more difficult.
2. Limited Functionality of OpenResty
Snowball Finance used OpenResty as its gateway previously, but OpenResty was weak in some functions. Therefore, developers need to make more customization when integrating OpenResty with its existing monitoring system. Furthermore, DevOps engineers needed to add custom scripts to implement the second-developing process.
3. Dependency on Self-Developed Registration Center
Currently, Snowball Finance conducts HTTP service registration by requesting the Registration Center to register it to the gateway when the backend service starts and to remove the service node when the service stops. The registration center will periodically poll the service nodes for health checks. However, compared with open-source projects, the maintenance cost of self-developed services is higher.
API Gateway Technology Selection
Based on the pain points gradually revealed in business practice scenarios, the Snowball Infrastructure Team has started research on gateway products.
The team internally hopes to both ensure business transparency and minimize changes while not introducing too many variables. The team also want to solve problems uniformly at the infrastructure level, and complete authentication services within the local data center. Considering the above, Snowball Finance has decided to move the authentication service to the API gateway.
Snowball Finance hopes the new API gateway can meet the following requirements:
- High performance
- Strong scalability
- Support for multiple protocols
- Low cost for gateway authentication
Below is a detailed API gateway technology selection list among OpenResty, Spring Gateway, Kong, and APISIX.
Gateway | Advantages | Disadvantages | O&M Cost | Availability |
---|---|---|---|---|
OpenResty | Highly customizable and stable | Poor observability | High | High |
Spring Gateway | Friendly to Java development | the performance level does not meet the requirements | Middle | Middle |
Kong | Powerful and rich in functions | PostgreSQL single-point database | Low | Middle |
APISIX | Cloud-native supports multi-programming languages and has strong scalability | / | Low | High |
After considering internal demands and comparing popular gateway products in the market, Snowball Finance ultimately chose Apache APISIX for subsequent architecture.
Practice Based on Apache APISIX
Adjusted Architecture
As shown in the figure above, the current active-active architecture of the Snowball market services is displayed on the left, which corresponds to the architecture in the original data center with few modifications. The right side shows the active-active architecture based on a multi-region design after migration to the cloud.
Snowball Finance mainly made the following adjustments based on APISIX:
- Unified the authentication module to the proxy layer and used APISIX for unified authentication. For JWT types, APISIX's jwt-auth plugin can be used for local authentication directly.
- Being compatible with OAuth 2.0, and utilized APISIX to call Snowball Finance User Center for processing.
- Connected with the Snowball Finance backend RPC service registry center for using its backend services to authenticate when JWT authentication fails.
Application Scenarios
After the backend service was connected to APISIX, some practices were mainly carried out in the aspects of gateway authentication and observability.
Scenario 1: Gateway Authentication
As mentioned earlier, Snowball Finance's previous architecture had no unified authentication method. One method relied on the internal application side, which used an SDK to call the user center for authentication, while the other used JWT authentication. When these two authentication methods coexisted, it caused issues with scalability and maintainability.
- After integrating APISIX as a gateway, Snowball Finance used the APISIX gateway layer to uniformly manage the authentication. The original JWT authentication method was replaced with the official plugin jwt-auth. Configuring and using the jwt-auth plugin is relatively simple: only by simply configuring the relevant information such as routes and upstream in the Dashboard, and the plugin will be enabled.
- Snowball Finance used the grpc-transcode plugin to proxy the authentication service calls to handle the previous OAuth 2.0-related authentication method.
Snowball Finance internally considered the following three solutions to call gRPC to implement authentication:
- Solution 1: Use Lua to call gRPC directly. Since this solution requires considering related implementations such as load balancing and dynamic upstream, the process will be more troublesome, so it was discarded.
- Solution 2: Use Lua coroutine to call back Golang logic. Snowball Finance abandoned this way because it lacked corresponding practical experience.
- Solution 3: Use Lua to make HTTP calls, and the gRPC interface is implemented using APISIX's grpc-transcode plugin. Thanks to the APISIX community’s fast plugin optimization iterations, Snowball Finance finally chose this option to implement gRPC calls.
Currently, manual synchronization of protocol buffer files is necessary during execution. This is because if the user center modifies the protocol buffer file, which does not match the version saved by APISIX, it can result in authentication issues.
Scenario 2: Multi-Dimensional Monitoring under Observability
It is necessary to monitor many metrics after the launch of websites in Snowball Finance, focusing mainly on the following three parts:
- NGINX connection status and inbound/outbound traffic
- HTTP error status code rate (used for troubleshooting service or upstream/downstream issues)
- APISIX request latency (the time consumed by the logic execution when APISIX forwards the request)
For example, in some cases, the latency of APISIX appears to be high (as shown in the figure below), which is related to the calculation logic of the latency. Currently, the logic is the time consumed by a single HTTP request on NGINX minus the latency of this request routing to the upstream. The difference between the two-time consumptions is the APISIX latency metrics.
After using APISIX, adding or modifying some plugins will lead to some logic changes, which may lead to deviations in latency-related data. In order to avoid confusing the authenticity of data, Snowball Finance also added a latency monitoring plugin. While ensuring the accuracy of each data monitoring, it is also convenient for subsequent locate problems in advance, so as to facilitate troubleshooting.
It is also possible to utilize the observability capabilities of APISIX to collect Access log and deliver it to the traffic dashboard in a formatted and standardized manner for viewing and summarization. This facilitates a proactive understanding of overall trends from multiple perspectives, identifying potential issues and addressing them promptly.
Scenario 3: Scaling the ZooKeeper Registry
In Snowball Finance, gRPC calls are registered and discovered based on the Zookeeper registry. In the process of implementing authentication, when the local JWT verification fails, the API gateway needs to access the gRPC service in the Snowball Finance User Center for authentication, which requires the API gateway to obtain the backend gRPC service address list from the registration center.
The APISIX official plugin apisix-seed can integrate ZooKeeper for service discovery. Snowball Finance has carried out some customization that are more specific to its own business.
The specific implementation is mainly on a content node of APISIX. When the Worker process starts, it polls the ZK-Rest cluster like the one in the figure below, and then regularly pulls the source data and information of the entire service, and updates them into the local cache of the Worker process for updating the services lists.
It can also be seen from the above figure that the ZK-Rest cluster accesses ZooKeeper data in the form of Rest. Only by adding an instance of it can achieve high availability features, eliminating some complicated operations. But this operation also brings a more obvious disadvantage. When periodically polling the ZK-Rest cluster, it may cause a delay in updating the service list.
Summary and Outlook
Currently, Apache APISIX is running well as the gateway layer within Snowball Finance. Specifically:
- Unified authentication, circuit breaking, and rate limiting functions are implemented at the gateway layer, reducing the coupling of the overall system and improving the service quality under dual data centers.
- With the help of the APISIX's monitoring, a unified monitoring solution from the gateway to the service is established, providing good support for global troubleshooting.
- An elegant implementation approach is provided for the conversion and service management of the gRPC protocol.
In the subsequent use, Snowball Finance is also planning to:
- Apply the APISIX Ingress Controller to the K8s cluster.
- Use the grpc-transcode plugin for HTTP/gRPC protocol conversion to achieve a unified backend interface.
- Use the traffic-spilt plugin for traffic labeling, connecting to the Nacos registry center, and implementing canary release and other service governance.
In the subsequent plans, Apache APISIX will be used to replace the existing OpenResty and ultimately achieve the management of full-lifecycle north-south traffic.