From Traefik to APISIX, Horizon Robotics's Exploration in Ingress Controller

October 10, 2022

Case Study

In the automotive industry, most companies are transitioning to autonomous driving and new energy. Especially for autonomous driving, each company has invested a lot of resources to complete the development and training of autonomous driving models.

In this process, how to ensure the stability and efficiency of the business while the product is rapidly iterating?

This article will take a look at Horizon Robotics's AI development platform as an example to see how the API gateway Apache APISIX and Ingress Controller helped Horizon Robotics's R&D team to solve this pain point.

Gateway Comparison

Traefik’s Limitations

Before using APISIX Ingress Controller, the Ingress Controller used by the business system was Traefik1.x, but there were several problems.

  • Traefik 1.x configures routing rules through Ingress, and some plugins need to be configured by adding Annotation. In this way, you can only add plugins for all rules under the current Ingress, and cannot achieve more granular configuration.
  • Traefik 1.x does not support visual configuration of specific rules and cannot directly locate a specific service by accessing the request URL through browsers.
  • Traefik's default configuration file (ConfigMap) has only few attributes, and many default configurations need to go through the official documentation, and some parameters are inconsistent with the NGINX default configuration, which makes it more troublesome to maintain.

In response to the above problems, Horizon Robotics's technical team decided to replace the Ingress Controller. At the beginning of the selection process, the team considered upgrading Traefik to 2.0 to solve the above problems, but since we also needed to use a new CRD to upgrade and the migration cost was expensive, we had to try other Ingress Controller solutions as well.

Advantages of APISIX Ingress Controller

In the early stage of selection, we mainly compared Apache APISIX, Kong and Envoy. However, other solutions more or less can not meet the needs of existing scenarios in terms of functionality or performance, except for APISIX Ingress. Therefore, we finally chose APISIX Ingress. In addition to some general features, we are more interested in the following points.

  • Rich Plugins: The plugins are ecologically sound, and all the plugins supported by APISIX can be configured declaratively using apisix-ingress-controller, and the plugins can be customized for a single backend under ApisixRoute.
  • Visual Configuration: With APISIX Dashboard, you can see each apisix route. And if the same domain is configured in multiple namespaces or YAML files, you can search for the path prefix in conjunction with APISIX Dashboard to quickly locate it in case of conflict.
  • Fine-grained Verification: APISIX Ingress Controller verifies the resources declared in the CRD it manages. If a non-existent service is declared in the CRD, the error message will be stored in the event of ApisixRoute and the change will not take effect, which can reduce some problems caused by misuse to some extent.
  • Rich Features: APISIX supports hot update and hot plugins, proxy request rewriting, multiple authentications, multi-language plugin development and many other features. Please refer to APISIX features for more information.
  • Active Community: Compared to other open source solution’s communities, APISIX has many active maintainers and contributors on Slack, GitHub, and the mailing list.
  • High Performance: As you can see from the chart below, APISIX’s performance is about 120% of Envoy's when comparing with Envoy's pressure test, and the more cores there are, the bigger the QPS difference is.

QPS

Overall Architecture

As you can see from the architecture diagram below, APISIX Ingress serves as an entry point for all traffic. All accessed traffic enters upstream (business services) through APISIX Ingress whether it is from command line tools, Web, SaaS platforms or OpenAPI. As for authentication, since the company has a dedicated authentication service itself, it directly uses APISIX's forward-auth plugin to achieve external authentication.

Architecture

At the gateway layer, all traffic enters through the domain name, and the traffic will first pass through LVS, which will be forwarded to the back-end APISIX node, and then APISIX will distribute the traffic to the corresponding Pod according to the routing rules. On LVS, they also changed the default port of APISIX Ingress from 9180 to 80 in order to make LVS point directly to APISIX Ingress, which makes it easier to forward the traffic.

Flow Chart

Scenarios

After understanding the overall architecture, we will share a few scenarios that our company is currently implementing with APISIX Ingress.

Oversized File Upload

First is the large file upload scenario, which may be less common in general companies, but is more common in companies that do AI model training. This scenario is mainly in the Horizon Robotics model training system, where the data collected by R&D will be uploaded to the system through the network, and the size of the data is usually over several hundred GB, and OOM will occur when the amount of uploaded data is too large without adjusting any parameters of APISIX.

iTerm

Because the default client_body_buffer_size is 1MB, when the buffer is full, the temporary files will be written to disk, thus causing high disk IO.

If the directory where the temporary files are written is pointed to the shared memory (/dev/shm), this again leads to high APISIX (cache).

Monitor

After continuous debugging, we found the reason was that APISIX did not enable streaming upload. For this scenario, we upgraded APISIX version from 2.11 to 2.13 and adjusted APISIX parameters. First, we changed the parameter proxy_request_buffering to off from APISIX ConfigMap to enable streaming upload. Second, we extracted the reusable configuration from the CRD ApisixPluginConfig provided by the APISIX Ingress Controller and dynamically set the client_max_body_size for the routes that need this scenario as namespace level configuration.

Debug

Service Calls in Multi-cloud Environments

For service calls in multi-cloud environments, part of the business traffics first arrive at the local IDC, and then go through APISIX Ingress to reach the Pod. Some services in the Pod will access AliCloud's services through the domain name. In addition, some scenarios where the service invokes other services also exist mainly for multi-cloud training. Users will take IDC as the entry point and select the cluster to submit the task to the corresponding cloud cluster.

Multi-cloud Architecture

External Authentication with forward-auth

When we first started using APISIX Ingress, APISIX did not support the forward-auth plugin, so we defined a custom plugin based on apisix-go-plugin-runner, but this created an additional layer of gRPC calls, which made debugging difficult and logging invisible. Since APISIX supported the forward-auth plugin at the beginning of this year, we replaced the custom plugin with the official one, which reduces one layer of gRPC calls and makes monitoring more convenient.

Authentication Architecture

Application Monitoring

In application monitoring, we enabled APISIX Prometheus plugin globally and made some debugging and optimization for our own business, such as adding real-time concurrency, QPS, APISIX real-time API success rate, and APISIX real-time bandwidth for more granular monitoring of APISIX.

Monitoring

Summary

We are currently using Apache APISIX Ingress Controller as a traffic gateway only for some of our business lines, and will go live with other businesses to bring richer application scenarios to the community. If you are also comparing Ingress Controller solutions, we hope this article will give you some hints. More and more users are using Apache APISIX Ingress in production environments, and if you are also using APISIX Ingress, please share your use cases in the community.

Topics:
Ingress ControllerApache APISIX