From Traefik to APISIX, Horizon Robotics's Exploration in Ingress Controller
Xin Zhang
October 10, 2022
In the automotive industry, most companies are transitioning to autonomous driving and new energy. Especially for autonomous driving, each company has invested a lot of resources to complete the development and training of autonomous driving models.
In this process, how to ensure the stability and efficiency of the business while the product is rapidly iterating?
This article will take a look at Horizon Robotics's AI development platform as an example to see how the API gateway Apache APISIX and Ingress Controller helped Horizon Robotics's R&D team to solve this pain point.
Gateway Comparison
Traefik’s Limitations
Before using APISIX Ingress Controller, the Ingress Controller used by the business system was Traefik1.x, but there were several problems.
- Traefik 1.x configures routing rules through Ingress, and some plugins need to be configured by adding Annotation. In this way, you can only add plugins for all rules under the current Ingress, and cannot achieve more granular configuration.
- Traefik 1.x does not support visual configuration of specific rules and cannot directly locate a specific service by accessing the request URL through browsers.
- Traefik's default configuration file (
ConfigMap
) has only few attributes, and many default configurations need to go through the official documentation, and some parameters are inconsistent with the NGINX default configuration, which makes it more troublesome to maintain.
In response to the above problems, Horizon Robotics's technical team decided to replace the Ingress Controller. At the beginning of the selection process, the team considered upgrading Traefik to 2.0 to solve the above problems, but since we also needed to use a new CRD to upgrade and the migration cost was expensive, we had to try other Ingress Controller solutions as well.
Advantages of APISIX Ingress Controller
In the early stage of selection, we mainly compared Apache APISIX, Kong and Envoy. However, other solutions more or less can not meet the needs of existing scenarios in terms of functionality or performance, except for APISIX Ingress. Therefore, we finally chose APISIX Ingress. In addition to some general features, we are more interested in the following points.
- Rich Plugins: The plugins are ecologically sound, and all the plugins supported by APISIX can be configured declaratively using
apisix-ingress-controller
, and the plugins can be customized for a single backend underApisixRoute
. - Visual Configuration: With APISIX Dashboard, you can see each
apisix route
. And if the same domain is configured in multiplenamespaces
or YAML files, you can search for the path prefix in conjunction with APISIX Dashboard to quickly locate it in case of conflict. - Fine-grained Verification: APISIX Ingress Controller verifies the resources declared in the CRD it manages. If a non-existent service is declared in the CRD, the error message will be stored in the
event
ofApisixRoute
and the change will not take effect, which can reduce some problems caused by misuse to some extent. - Rich Features: APISIX supports hot update and hot plugins, proxy request rewriting, multiple authentications, multi-language plugin development and many other features. Please refer to APISIX features for more information.
- Active Community: Compared to other open source solution’s communities, APISIX has many active maintainers and contributors on Slack, GitHub, and the mailing list.
- High Performance: As you can see from the chart below, APISIX’s performance is about 120% of Envoy's when comparing with Envoy's pressure test, and the more cores there are, the bigger the QPS difference is.
Overall Architecture
As you can see from the architecture diagram below, APISIX Ingress serves as an entry point for all traffic. All accessed traffic enters upstream (business services) through APISIX Ingress whether it is from command line tools, Web, SaaS platforms or OpenAPI. As for authentication, since the company has a dedicated authentication service itself, it directly uses APISIX's forward-auth
plugin to achieve external authentication.
At the gateway layer, all traffic enters through the domain name, and the traffic will first pass through LVS, which will be forwarded to the back-end APISIX node, and then APISIX will distribute the traffic to the corresponding Pod according to the routing rules. On LVS, they also changed the default port of APISIX Ingress from 9180
to 80
in order to make LVS point directly to APISIX Ingress, which makes it easier to forward the traffic.
Scenarios
After understanding the overall architecture, we will share a few scenarios that our company is currently implementing with APISIX Ingress.
Oversized File Upload
First is the large file upload scenario, which may be less common in general companies, but is more common in companies that do AI model training. This scenario is mainly in the Horizon Robotics model training system, where the data collected by R&D will be uploaded to the system through the network, and the size of the data is usually over several hundred GB, and OOM will occur when the amount of uploaded data is too large without adjusting any parameters of APISIX.
Because the default client_body_buffer_size
is 1MB, when the buffer is full, the temporary files will be written to disk, thus causing high disk IO.
If the directory where the temporary files are written is pointed to the shared memory (/dev/shm
), this again leads to high APISIX (cache).
After continuous debugging, we found the reason was that APISIX did not enable streaming upload. For this scenario, we upgraded APISIX version from 2.11 to 2.13 and adjusted APISIX parameters. First, we changed the parameter proxy_request_buffering
to off from APISIX ConfigMap
to enable streaming upload. Second, we extracted the reusable configuration from the CRD ApisixPluginConfig
provided by the APISIX Ingress Controller and dynamically set the client_max_body_size
for the routes that need this scenario as namespace
level configuration.
Service Calls in Multi-cloud Environments
For service calls in multi-cloud environments, part of the business traffics first arrive at the local IDC, and then go through APISIX Ingress to reach the Pod. Some services in the Pod will access AliCloud's services through the domain name. In addition, some scenarios where the service invokes other services also exist mainly for multi-cloud training. Users will take IDC as the entry point and select the cluster to submit the task to the corresponding cloud cluster.
External Authentication with forward-auth
When we first started using APISIX Ingress, APISIX did not support the forward-auth
plugin, so we defined a custom plugin based on apisix-go-plugin-runner, but this created an additional layer of gRPC calls, which made debugging difficult and logging invisible. Since APISIX supported the forward-auth plugin at the beginning of this year, we replaced the custom plugin with the official one, which reduces one layer of gRPC calls and makes monitoring more convenient.
Application Monitoring
In application monitoring, we enabled APISIX Prometheus
plugin globally and made some debugging and optimization for our own business, such as adding real-time concurrency, QPS, APISIX real-time API success rate, and APISIX real-time bandwidth for more granular monitoring of APISIX.
Summary
We are currently using Apache APISIX Ingress Controller as a traffic gateway only for some of our business lines, and will go live with other businesses to bring richer application scenarios to the community. If you are also comparing Ingress Controller solutions, we hope this article will give you some hints. More and more users are using Apache APISIX Ingress in production environments, and if you are also using APISIX Ingress, please share your use cases in the community.