Smooth Canary Release Using APISIX Ingress Controller with Flagger

Hengliang Tan

January 19, 2023

Ecosystem

Author: Hengliang Tan, Engineering Manager at XPENG

In the process of project development, service updates are often a challenge. To provide the best user experience, we need to avoid the risk of service unavailability as much as possible. Thus, continuous delivery was born, accepted as an enterprise software practice, and a natural evolution of well-established continuous integration principles. However, continuous deployment is still very rare due to the complexity of management and the fear that deployment failures will affect system availability. Canary release is probably the most classic scenario in the continuous delivery system. Based on this, we can quickly discover unhealthy and problematic services and roll back to the previous version effortlessly.

Canary Release

Canary release is also known as grayscale release. Generally speaking, the new version of the application is released and deployed as a "canary" to test the performance. The old version remains for normal operations at the same stage. During the upgrade, some users will be directed to use the new version, while other users will continue to use the old version. On the premise of ensuring the overall system's stability, it enables early detection of bugs and timely adjustment.

The canary release does not directly release the update. It slowly guides a certain percentage of traffic to a small number of users. If there are no errors detected, it will be promoted to all users, and the old version will be phased out. This method reduces the risk of introducing new functions into the production environment.

This article will introduce how to achieve smooth canary release through Apache APISIX Ingress and Flagger, improve release efficiency, and reduce release risks.

About Apache APISIX Ingress

Apache APISIX Ingress is realized by the Kubernetes Ingress Controller that uses Apache APISIX as the data plane proxy. It provides hundreds of functions, such as load balancing, dynamic upstream, canary release, fine-grained routing, rate-limiting, service degradation, service circuit breaker, authentication, and observability. It has been adopted by domestic and foreign companies and organizations, including Zoom, Tencent Cloud, Jiakaobaodian, Horizon Robotics, European Copernicus Reference System, etc.

About Flagger

Flagger is a CNCF (Cloud Native Computing Foundation) project and part of the Flux family of GitOps tools. Recently, the CNCF also announced the official graduation of Flux, which is a good indicator of the success and promising future of cloud-native technology. As a progressive delivery tool, Flagger automates the release process for applications running on Kubernetes. It reduces the risk of introducing a new software version in production by gradually shifting traffic to the new version while measuring analytics metrics and running conformance tests.

After continuous efforts of the Apache APISIX and Flux communities, Flagger recently released v1.27.0, which supports automated canary releases using Apache APISIX Ingress and Flagger.

featured-<Flagger and Apache APISIX Ingress>.jpg

Let's experience this smooth canary release process together.

Environment

Requires a v1.19 or newer Kubernetes cluster, which you can install via kind.

Install Components

Use Helm V3 to install Apache APISIX and Apache APISIX Ingress Controller

helm repo add apisix https://charts.apiseven.com kubectl create ns apisix helm upgrade -i apisix apisix/apisix --version=0.11.3 \ --namespace apisix \ --set apisix.podAnnotations."prometheus\.io/scrape"=true \ --set apisix.podAnnotations."prometheus\.io/port"=9091 \ --set apisix.podAnnotations."prometheus\.io/path"=/apisix/prometheus/metrics \ --set pluginAttrs.prometheus.export_addr.ip=0.0.0.0 \ --set pluginAttrs.prometheus.export_addr.port=9091 \ --set pluginAttrs.prometheus.export_uri=/apisix/prometheus/metrics \ --set pluginAttrs.prometheus.metric_prefix=apisix_ \ --set ingress-controller.enabled=true \ --set ingress-controller.config.apisix.serviceNamespace=apisix

Install the Flagger and Prometheus components in the apisix namespace.

helm repo add flagger https://flagger.app helm upgrade -i flagger flagger/flagger \ --namespace apisix \ --set prometheus.install=true \ --set meshProvider=apisix

Note: if you need to customize Prometheus or Prometheus Operator, you can search related articles for modification.

Application Initialization

Flagger can be applied to Kubernetes deployment and other workloads and can also be combined with HPA. It will create a series of objects: Kubernetes deployments, ClusterIP services, and ApisixRoute. These objects can expose applications to outside clusters to provide services and are used for the analysis of the canary release process.

Create a new test namespace:

kubectl create ns test

Create a new deployment and HPA. Here we extract the official code sample from Flagger.

kubectl apply -k https://github.com/fluxcd/flagger//kustomize/podinfo?ref=main

Deploy Flagger's load testing service to generate traffic during canary release for analysis.

helm upgrade -i flagger-loadtester flagger/loadtester \ --namespace=test

Create the ApisixRoute of Apache APISIX, and then Flagger will reference the created resource and generate the ApisixRoute of Apache APISIX Ingress in the canary version. (Replace app.example.com in the below example with your actual domain name)

apiVersion: apisix.apache.org/v2 kind: ApisixRoute metadata: name: podinfo namespace: test spec: http: - backends: - serviceName: podinfo servicePort: 80 match: hosts: - app.example.com methods: - GET paths: - /* name: method plugins: - name: prometheus enable: true config: disable: false prefer_name: true

Save it as podinfo-apisixroute.yaml and submit it to the cluster:

kubectl apply -f ./podinfo-apisixroute.yaml

Create a Flagger custom resource Canary. (Replace app.example.com in the example with your actual domain name)

apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: podinfo namespace: test spec: provider: apisix targetRef: apiVersion: apps/v1 kind: Deployment name: podinfo # Refer apisix route routeRef: apiVersion: apisix.apache.org/v2 kind: ApisixRoute name: podinfo progressDeadlineSeconds: 60 service: port: 80 targetPort: 9898 analysis: interval: 10s # maximum number of failures for roll back threshold: 10 # maximum percentage of traffic to the canary version # (0-100) maxWeight: 50 # the step size of the canary analysis # (0-100) stepWeight: 10 # use Prometheus to check the traffic information of APISIX metrics: - name: request-success-rate # the minimum success rate (none 5xx responses) # (0-100) thresholdRange: min: 99 interval: 1m - name: request-duration # P99 is the largest request delay(ms) thresholdRange: max: 500 interval: 30s webhooks: # automated traffic for canary analysis, modified based on the actual scenario - name: load-test url: http://flagger-loadtester.test/ timeout: 5s type: rollout metadata: cmd: |- hey -z 1m -q 10 -c 2 -h2 -host app.example.com http://apisix-gateway.apisix/api/info

Save it as podinfo-canary.yaml and submit it to the cluster:

kubectl apply -f ./podinfo-canary.yaml

Flagger will automatically generate related resources:

# Submitted deployment.apps/podinfo horizontalpodautoscaler.autoscaling/podinfo apisixroute/podinfo canary.flagger.app/podinfo

# Auto-generated deployment.apps/podinfo-primary horizontalpodautoscaler.autoscaling/podinfo-primary service/podinfo service/podinfo-canary service/podinfo-primary apisixroute/podinfo-podinfo-canary

featured-<version 1>.jpg

At this point, you can access the application through the domain name app.example.com (Replace app.example.com in the example with your actual domain name), and you will see the current version of the application.

Automation of Canary Release

Flagger implements a control loop that gradually shifts traffic to canary nodes while measuring key performance metrics such as HTTP request success rate, average request duration, and pod health. According to the analysis of relevant indicators, release or stop the canary deployment and publish the analysis results to relevant platforms such as Slack, MS Teams or Prometheus Alert Manager, etc.

Flagger Control Loop

Trigger a canary release by updating the container image version

kubectl -n test set image deployment/podinfo \ podinfod=stefanprodan/podinfo:6.0.1

Flagger detects that there is a new version of the deployment and will start a trial run of the canary analysis release.

kubectl -n test describe canary/podinfo Status: Canary Weight: 0 Conditions: Message: Canary analysis completed successfully, promotion finished. Reason: Succeeded Status: True Type: Promoted Failed Checks: 1 Iterations: 0 Phase: Succeeded Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Synced 2m59s flagger podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less than desired generation Warning Synced 2m50s flagger podinfo-primary.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available Normal Synced 2m40s (x3 over 2m59s) flagger all the metrics providers are available! Normal Synced 2m39s flagger Initialization done! podinfo.test Normal Synced 2m20s flagger New revision detected! Scaling up podinfo.test Warning Synced 2m (x2 over 2m10s) flagger canary deployment podinfo.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available Normal Synced 110s flagger Starting canary analysis for podinfo.test Normal Synced 109s flagger Advance podinfo.test canary weight 10 Warning Synced 100s flagger Halt advancement no values found for apisix metric request-success-rate probably podinfo.test is not receiving traffic: running query failed: no values found Normal Synced 90s flagger Advance podinfo.test canary weight 20 Normal Synced 80s flagger Advance podinfo.test canary weight 30 Normal Synced 69s flagger Advance podinfo.test canary weight 40 Normal Synced 59s flagger Advance podinfo.test canary weight 50 Warning Synced 30s (x2 over 40s) flagger podinfo-primary.test not ready: waiting for rollout to finish: 1 old replicas are pending termination Normal Synced 9s (x3 over 50s) flagger (combined from similar events): Promotion completed! Scaling down podinfo.test

During the canary release process, you will receive different responses when you access the application through the domain name app.example.com (Replace app.example.com with your actual domain name).

featured-<version 2>.jpg

By viewing the ApisixRoute resource podinfo-podinfo-canary of Apache APISIX created automatically by Flagger, you will find that the weights of service podinfo-primary and service podinfo-canary change along with the publishing process.

spec: http: - backends: - serviceName: podinfo-primary servicePort: 80 # Auto-adjusted by Flagger weight: 80 - serviceName: podinfo-canary servicePort: 80 # Auto-adjusted by Flagger weight: 20

You will see the latest stable version when the final release is complete.

featured-<version 3>.jpg

Note: Flagger will re-run the canary analysis if you change the deployment again during the canary release.

You can observe all canary releases with this command:

watch kubectl get canaries --all-namespaces NAMESPACE NAME STATUS WEIGHT LASTTRANSITIONTIME test podinfo-2 Progressing 10 2022-11-23T05:00:54Z test podinfo Succeeded 0 2022-11-23T06:00:54Z

Rollback

During canary release analysis, you can test Flagger to suspend the canary release and rollback to the old version by generating an HTTP 500 Bad Request.

Trigger another canary release:

kubectl -n test set image deployment/podinfo \ podinfod=stefanprodan/podinfo:6.0.2

Enter load tester container

kubectl -n test exec -it deploy/flagger-loadtester bash

Generate HTTP 500 error:

hey -z 1m -c 5 -q 5 -host app.example.com http://apisix-gateway.apisix/status/500

Simulate server delay:

watch -n 1 curl -H \"host: app.example.com\" http://apisix-gateway.apisix/delay/1

When the number of detected failures reaches the threshold of canary analysis, the traffic is automatically routed back to the master node, the canary node is scaled down to zero, and the canary release process is marked as failed.

kubectl -n apisix logs deploy/flagger -f | jq .msg "New revision detected! Scaling up podinfo.test" "canary deployment podinfo.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available" "Starting canary analysis for podinfo.test" "Advance podinfo.test canary weight 10" "Halt podinfo.test advancement success rate 0.00% < 99%" "Halt podinfo.test advancement success rate 26.76% < 99%" "Halt podinfo.test advancement success rate 34.19% < 99%" "Halt podinfo.test advancement success rate 37.32% < 99%" "Halt podinfo.test advancement success rate 39.04% < 99%" "Halt podinfo.test advancement success rate 40.13% < 99%" "Halt podinfo.test advancement success rate 48.28% < 99%" "Halt podinfo.test advancement success rate 50.35% < 99%" "Halt podinfo.test advancement success rate 56.92% < 99%" "Halt podinfo.test advancement success rate 67.70% < 99%" "Rolling back podinfo.test failed checks threshold reached 10" "Canary failed! Scaling down podinfo.test"

Customize Metrics for Canary Analysis

Canary analysis can be extended by querying Prometheus metrics. We customize based on actual business scenarios. Create a metric template and submit it to the cluster.

apiVersion: flagger.app/v1beta1 kind: MetricTemplate metadata: name: not-found-percentage namespace: test spec: provider: type: prometheus address: http://flagger-prometheus.apisix:9090 query: | sum( rate( apisix_http_status{ route=~"{{ namespace }}_{{ route }}-{{ target }}-canary_.+", code!~"4.." }[{{ interval }}] ) ) / sum( rate( apisix_http_status{ route=~"{{ namespace }}_{{ route }}-{{ target }}-canary_.+" }[{{ interval }}] ) ) * 100 # Modify the analysis in the canary release and add the indicator template created above. analysis: metrics: - name: "404s percentage" templateRef: name: not-found-percentage thresholdRange: max: 5 interval: 1m

The configuration above validates the canary by checking if the QPS (Queries per second) of HTTP 404 requests is higher than 5% of the total traffic. The canary rollout fails if the HTTP 404 requests exceed the 5% threshold.

Summary

The above process can be extended with more custom metric checks, Webhook, manual approvals and Slack or MS Teams notifications.

A very smooth canary release is achieved through the integration of Apache APISIX and Flagger, which improves release efficiency and reduces release risks. In the future, the two communities will cooperate more closely to realize more publishing capabilities such as Blue/Green Mirroring and A/B Testing.

Tags: