iQIYI Unlocks API Gateway Innovation with Apache APISIX
- Large traffic results in tons of CPU IDLE too-low alerts daily.
- Too many components depend on the system architecture.
- High O&M (Operation and Maintenance) cost.
- Choose an outperformed API gateway to match iQIYI's requirements.
- Minimize the migration cost.
- Find some API gateway with an active community and healthy ecosystem.
- Build a brand-new API gateway to fit the cloud-native trend.
- Performance improved 10x than before, undertaking millions of QPS daily.
- Easily Supported over 9,000 API online business projects.
- Successfully realized the data disaster recovery of multiple sites and levels China-wide.
What happened behind iQIYI?
Cong He, the senior R&D engineer of iQIYI, shared a speech at the Apache APISIX Meetup in Shanghai recently. He works in the IIG infrastructure department's computing cloud team and is in charge of the development of gateway and Ops work in iQIYI. Let's get into his speech and iQIYI's API gateway story to have a better understanding of Apache APISIX.
Being founded on April 22, 2010, by Baidu, the company behind China's largest online search engine, iQIYI is currently one of the largest online video sites in the world, with nearly 6 billion hours spent on its service each month and over 500 million monthly active users.
iQIYI used to develop its own API gateway called Skywalker, a custom development based on Kong. Skywalker needs to handle massive traffic nowadays, and its daily peak of the gateway service could reach a million QPS and thousands of API routes. However, this product's shortcomings have begun to show up gradually.
- The gateway performance no longer satisfies iQIYI's requirements, and it receives tons of CPU IDLE too-low alerts daily due to massive traffic.
- There are too many component dependencies on the system architecture.
- O&M (Operation and Maintenance) cost is too high.
After taking over this project this year, Cong started investigating similar gateway products to resolve the above issues and difficulties and finally found Apache APISIX.
How Does Apache APISIX Outpaces Kong
Before choosing Apache APISIX, iQIYI had already started using Kong, but they abandoned it later.
Why abandon Kong?
After actual practice with Kong, Cong demonstrates why his team abandoned Kong. Below are several of Kong's core disadvantages.
- Kong's routes use traversal queries, which are not fast.
- Kong's Postgres database results in bloated deployment, low-efficient synchronization, and low availability.
- The code is high in the coupling.
Why choose APISIX?
"We compared the performance between Apache APISIX and Kong during the investigation, and surprisingly found Apache APISIX was 10 times better than Kong in terms of performance optimization. We also compared Apache APISIX to some other major gateway products, and Apache APISIX's response latency is always more than 50% lower than other products. Furthermore, Apache APISIX can still run stably even if the CPU usage reaches more than 70%. APISIX is really amazing!" Cong said.
Both Apache APISIX and Kong were developed based on OpenRestry at the technical level, which brings a relatively low migration cost. Besides, Apache APISIX has excellent adaptability that can be easily deployed on many different environments, including cloud computing platforms.
"Meanwhile, we also found Apache APISIX is a highly active open-source project that resolves issues very quickly. Its cloud-native framework also aligns with our company's follow-up plans. Thus, we chose Apache APISIX as our API gateway."
iQIYI's API Gateway Architecture Innovation after Using APISIX
After choosing the great API gateway, iQIYI began to establish its new API gateway architecture, which is shown below, including the domain name, gateway, service instances, and monitoring alarm.
DPVS is an open-source project developed based on LVS by iQIYI. Hubble monitoring alarm is also a deep custom development based on an open-source project, and some optimizations on Consul's performance and high usability were made.
Achievement 1: Improved the data and control planes for cluster and service management
The data plane is mainly oriented to frontend users, and the entire architecture from LB to gateway has multi-site and multi-link deployments for disaster recovery, thus that users can access their nearest data center.
For the control plane, there exists a microservice platform to manage multiple clusters and services. The microservice platform allows users to experience one-stop service without submitting tickets, saving a significant amount of time. At the backend, the gateway controller mainly controls the configuration of all APIs, such as API creation and plugins, while the service controller handles registration, cancellation, and health check.
Achievement 2: Added more features: security control, rate-limiting, and monitoring
iQIYI implemented some basic functionalities in API architecture like rate-limiting, authentication, alarm, monitor, etc., after adjusting to Apache APISIX.
The first part is about HTTPS. For security control, iQIYI doesn't store any certificates or keys at gateway servers but on a dedicated remote server. However, it was difficult to realize while using Kong; instead, iQIYI used the prefix NGINX to do HTTPS offload. After migrating to Apache APISIX, iQIYI successfully implemented this feature on Apache APISIX, which saves one more tier transferring over the link.
In terms of rate-limiting, apart from basic rate-limiting functionalities, precise rate-limiting and rate-limiting against user granularity were also implemented. For the authentication, specialized services for passport authentication were provided. Moreover, iQIYI can access the company's WAF security cloud to filter out the underground industry.
Monitoring alarm functionality is achieved by using Apache APISIX's built-in plugin - Prometheus, and indicator data will be directly sent to the company's monitor system. Apache APISIX also supports iQIYI with logging and trace analysis services.
Achievement 3: Established dynamic service discovery updating process
Regarding the service discovery mentioned above, it mainly registers services to Consul clusters via the service center and then uses DNS service discovery to update them dynamically. QAE in the graph is a microservice platform used internally in our company. Let's use an example to briefly demonstrate the process of updating instances.
When updating the instances, we will first log out corresponding nodes from Consul and send updating DNS cache requests to the gateway via the API gateway controller. When the cache has been successfully updated, the controller will send back requests to the QAE platform to stop all associated backend application nodes to avoid reforwarding traffic to any offline nodes.
Achievement 4: Improved directional routes capability
The gateway has multi-site deployments, and a full set of multi-site backup links was built up in advance. Besides that, Cong also suggests users have multi-site nearest-access deployments for their backend service. Thus, users could create an API service in the Skywalker gateway platform, and the controller would deploy API routes to all DC gateway clusters. Meanwhile, the service domain's default CNAME will be routed to a unified gateway domain name.
Apache APISIX could directly provide service with nearest-access multi-site deployments and failover capability for disaster recovery and supports custom routes resolution defined by users. Furthermore, users could customize the configuration of route resolution through the UUID domain name if they need failover, blue-green deployment, and canary release. Additionally, Apache APISIX also supports the custom scheduling of backend service recovery.
Achievement 5: Improved multi-site & multi-level disaster tolerance
In order to handle situations like large traffic, tons of clusters, and a wide audience of clients, iQIYI requires access to the nearest service and disaster recovery at the operational level.
In addition to multi-site and multi-link backups for disaster recovery, iQIYI still needs to consider the issues about multi-level and multi-node situations. APISIX helps with this. The clients are closer to the dead nodes, the more significant impacts on the business and traffic.
- If the farthest backend service node is down, iQIYI could achieve the single node cutoff or failover of dead DC through the service center and gateway health check mechanism. Thus, the impact would be limited to specific services, affecting no users.
- If the gateway is down, then iQIYI could use L4 DPVS's health check mechanism to cut off failed gateway nodes, and the impact is relatively small, affecting no users.
- If the above circuit breaker measures cannot repair the dead node, then iQIYI could achieve automatic failover in DNS through domain name granularity's multi-node usability dail test in the extranet. However, this method is relatively slow and could impact many other services, and users could be aware of this.
iQIYI's Future Plan
In the integration with APISIX, iQIYI is trying to optimize some issues to fit its business better.
Considering the possible bottlenecks of some dependent components, such as etcd, Prometheus monitor, and logging service, iQIYI plans to use a hybrid deployment method. That is: sharing information inside large clusters and keeping small clusters independent. For example, the vital services will be deployed with the small clusters.
More corresponding reduction and optimization for Prometheus monitoring metrics will be conducted, especially for DNS service discovery.
Cong said, "We hope Apache APISIX can support more features and maintain excellent performance efficiency as well as stability in future developments and updates."
Looking for APISIX support?
Do you want to accelerate your development with confidence like iQIYI? To maximize APISIX support, you need API7. We provide in-depth support for APISIX and API management solutions based on your needs!
Contact us whenever you want: https://api7.ai/contact.