Smooth Canary Release Using APISIX Ingress Controller with Flagger

Author: Hengliang Tan, Engineering Manager at XPENG

In the process of project development, service updates are often a challenge. To provide the best user experience, we need to avoid the risk of service unavailability as much as possible. Thus, continuous delivery was born, accepted as an enterprise software practice, and a natural evolution of well-established continuous integration principles. However, continuous deployment is still very rare due to the complexity of management and the fear that deployment failures will affect system availability. Canary release is probably the most classic scenario in the continuous delivery system. Based on this, we can quickly discover unhealthy and problematic services and roll back to the previous version effortlessly.

Canary Release

Canary release is also known as grayscale release. Generally speaking, the new version of the application is released and deployed as a "canary" to test the performance. The old version remains for normal operations at the same stage. During the upgrade, some users will be directed to use the new version, while other users will continue to use the old version. On the premise of ensuring the overall system's stability, it enables early detection of bugs and timely adjustment.

The canary release does not directly release the update. It slowly guides a certain percentage of traffic to a small number of users. If there are no errors detected, it will be promoted to all users, and the old version will be phased out. This method reduces the risk of introducing new functions into the production environment.

This article will introduce how to achieve smooth canary release through Apache APISIX Ingress and Flagger, improve release efficiency, and reduce release risks.

About Apache APISIX Ingress

Apache APISIX Ingress is realized by the Kubernetes Ingress Controller that uses Apache APISIX as the data plane proxy. It provides hundreds of functions, such as load balancing, dynamic upstream, canary release, fine-grained routing, rate-limiting, service degradation, service circuit breaker, authentication, and observability. It has been adopted by domestic and foreign companies and organizations, including Zoom, Tencent Cloud, Jiakaobaodian, Horizon Robotics, European Copernicus Reference System, etc.

About Flagger

Flagger is a CNCF (Cloud Native Computing Foundation) project and part of the Flux family of GitOps tools. Recently, the CNCF also announced the official graduation of Flux, which is a good indicator of the success and promising future of cloud-native technology. As a progressive delivery tool, Flagger automates the release process for applications running on Kubernetes. It reduces the risk of introducing a new software version in production by gradually shifting traffic to the new version while measuring analytics metrics and running conformance tests.

After continuous efforts of the Apache APISIX and Flux communities, Flagger recently released v1.27.0, which supports automated canary releases using Apache APISIX Ingress and Flagger.

featured-<Flagger and Apache APISIX Ingress>.jpg

Let's experience this smooth canary release process together.

Environment

Requires a v1.19 or newer Kubernetes cluster, which you can install via kind.

Install Components

Use Helm V3 to install Apache APISIX and Apache APISIX Ingress Controller

helm repo add apisix https://charts.apiseven.com
kubectl create ns apisix


helm upgrade -i apisix apisix/apisix --version=0.11.3 \
--namespace apisix \
--set apisix.podAnnotations."prometheus\.io/scrape"=true \
--set apisix.podAnnotations."prometheus\.io/port"=9091 \
--set apisix.podAnnotations."prometheus\.io/path"=/apisix/prometheus/metrics \
--set pluginAttrs.prometheus.export_addr.ip=0.0.0.0 \
--set pluginAttrs.prometheus.export_addr.port=9091 \
--set pluginAttrs.prometheus.export_uri=/apisix/prometheus/metrics \
--set pluginAttrs.prometheus.metric_prefix=apisix_ \
--set ingress-controller.enabled=true \
--set ingress-controller.config.apisix.serviceNamespace=apisix

Install the Flagger and Prometheus components in the apisix namespace.

helm repo add flagger https://flagger.app


helm upgrade -i flagger flagger/flagger \
--namespace apisix \
--set prometheus.install=true \
--set meshProvider=apisix

Note: if you need to customize Prometheus or Prometheus Operator, you can search related articles for modification.

Application Initialization

Flagger can be applied to Kubernetes deployment and other workloads and can also be combined with HPA. It will create a series of objects: Kubernetes deployments, ClusterIP services, and ApisixRoute. These objects can expose applications to outside clusters to provide services and are used for the analysis of the canary release process.

Create a new test namespace:

kubectl create ns test

Create a new deployment and HPA. Here we extract the official code sample from Flagger.

kubectl apply -k https://github.com/fluxcd/flagger//kustomize/podinfo?ref=main

Deploy Flagger's load testing service to generate traffic during canary release for analysis.

helm upgrade -i flagger-loadtester flagger/loadtester \ --namespace=test

Create the ApisixRoute of Apache APISIX, and then Flagger will reference the created resource and generate the ApisixRoute of Apache APISIX Ingress in the canary version. (Replace app.example.com in the below example with your actual domain name)

apiVersion: apisix.apache.org/v2
kind: ApisixRoute
metadata:
  name: podinfo
  namespace: test
spec:
  http:
    - backends:
        - serviceName: podinfo
          servicePort: 80
      match:
        hosts:
          - app.example.com
        methods:
          - GET
        paths:
          - /*
      name: method
      plugins:
        - name: prometheus
          enable: true
          config:
            disable: false
            prefer_name: true

Save it as podinfo-apisixroute.yaml and submit it to the cluster:

kubectl apply -f ./podinfo-apisixroute.yaml

Create a Flagger custom resource Canary. (Replace app.example.com in the example with your actual domain name)

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  provider: apisix
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  # Refer apisix route
  routeRef:
    apiVersion: apisix.apache.org/v2
    kind: ApisixRoute
    name: podinfo
  progressDeadlineSeconds: 60
  service:
    port: 80
    targetPort: 9898
  analysis:
    interval: 10s
    # maximum number of failures for roll back
    threshold: 10
    # maximum percentage of traffic to the canary version
    # (0-100)
    maxWeight: 50
    # the step size of the canary analysis
    # (0-100)
    stepWeight: 10
    # use Prometheus to check the traffic information of APISIX
    metrics:
      - name: request-success-rate
        # the minimum success rate (none 5xx responses)
        # (0-100)
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        # P99 is the largest request delay(ms)
        thresholdRange:
          max: 500
        interval: 30s
    webhooks:
        # automated traffic for canary analysis, modified based on the actual scenario
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        type: rollout
        metadata:
          cmd: |-
            hey -z 1m -q 10 -c 2 -h2 -host app.example.com http://apisix-gateway.apisix/api/info

Save it as podinfo-canary.yaml and submit it to the cluster:

kubectl apply -f ./podinfo-canary.yaml

Flagger will automatically generate related resources:

# Submitted deployment.apps/podinfo horizontalpodautoscaler.autoscaling/podinfo apisixroute/podinfo canary.flagger.app/podinfo

# Auto-generated deployment.apps/podinfo-primary horizontalpodautoscaler.autoscaling/podinfo-primary service/podinfo service/podinfo-canary service/podinfo-primary apisixroute/podinfo-podinfo-canary

featured-<version 1>.jpg

At this point, you can access the application through the domain name app.example.com (Replace app.example.com in the example with your actual domain name), and you will see the current version of the application.

Automation of Canary Release

Flagger implements a control loop that gradually shifts traffic to canary nodes while measuring key performance metrics such as HTTP request success rate, average request duration, and pod health. According to the analysis of relevant indicators, release or stop the canary deployment and publish the analysis results to relevant platforms such as Slack, MS Teams or Prometheus Alert Manager, etc.

Flagger Control Loop

Trigger a canary release by updating the container image version

kubectl -n test set image deployment/podinfo \ podinfod=stefanprodan/podinfo:6.0.1

Flagger detects that there is a new version of the deployment and will start a trial run of the canary analysis release.

kubectl -n test describe canary/podinfo

Status:
  Canary Weight:  0
  Conditions:
    Message:               Canary analysis completed successfully, promotion finished.
    Reason:                Succeeded
    Status:                True
    Type:                  Promoted
  Failed Checks:           1
  Iterations:              0
  Phase:                   Succeeded

Events:
  Type     Reason  Age                    From     Message
  ----     ------  ----                   ----     -------
  Warning  Synced  2m59s                  flagger  podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less than desired generation
  Warning  Synced  2m50s                  flagger  podinfo-primary.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available
  Normal   Synced  2m40s (x3 over 2m59s)  flagger  all the metrics providers are available!
  Normal   Synced  2m39s                  flagger  Initialization done! podinfo.test
  Normal   Synced  2m20s                  flagger  New revision detected! Scaling up podinfo.test
  Warning  Synced  2m (x2 over 2m10s)     flagger  canary deployment podinfo.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available
  Normal   Synced  110s                   flagger  Starting canary analysis for podinfo.test
  Normal   Synced  109s                   flagger  Advance podinfo.test canary weight 10
  Warning  Synced  100s                   flagger  Halt advancement no values found for apisix metric request-success-rate probably podinfo.test is not receiving traffic: running query failed: no values found
  Normal   Synced  90s                    flagger  Advance podinfo.test canary weight 20
  Normal   Synced  80s                    flagger  Advance podinfo.test canary weight 30
  Normal   Synced  69s                    flagger  Advance podinfo.test canary weight 40
  Normal   Synced  59s                    flagger  Advance podinfo.test canary weight 50
  Warning  Synced  30s (x2 over 40s)      flagger  podinfo-primary.test not ready: waiting for rollout to finish: 1 old replicas are pending termination
  Normal   Synced  9s (x3 over 50s)       flagger  (combined from similar events): Promotion completed! Scaling down podinfo.test

During the canary release process, you will receive different responses when you access the application through the domain name app.example.com (Replace app.example.com with your actual domain name).

featured-<version 2>.jpg

By viewing the ApisixRoute resource podinfo-podinfo-canary of Apache APISIX created automatically by Flagger, you will find that the weights of service podinfo-primary and service podinfo-canary change along with the publishing process.

spec:
  http:
    - backends:
        - serviceName: podinfo-primary
          servicePort: 80
          # Auto-adjusted by Flagger
          weight: 80
        - serviceName: podinfo-canary
          servicePort: 80
          # Auto-adjusted by Flagger
          weight: 20

You will see the latest stable version when the final release is complete.

featured-<version 3>.jpg

Note: Flagger will re-run the canary analysis if you change the deployment again during the canary release.

You can observe all canary releases with this command:

watch kubectl get canaries --all-namespaces

NAMESPACE   NAME      STATUS      WEIGHT   LASTTRANSITIONTIME
test        podinfo-2   Progressing   10       2022-11-23T05:00:54Z
test        podinfo     Succeeded     0        2022-11-23T06:00:54Z

Rollback

During canary release analysis, you can test Flagger to suspend the canary release and rollback to the old version by generating an HTTP 500 Bad Request.

Trigger another canary release:

kubectl -n test set image deployment/podinfo \ podinfod=stefanprodan/podinfo:6.0.2

Enter load tester container

kubectl -n test exec -it deploy/flagger-loadtester bash

Generate HTTP 500 error:

hey -z 1m -c 5 -q 5 -host app.example.com http://apisix-gateway.apisix/status/500

Simulate server delay:

watch -n 1 curl -H \"host: app.example.com\" http://apisix-gateway.apisix/delay/1

When the number of detected failures reaches the threshold of canary analysis, the traffic is automatically routed back to the master node, the canary node is scaled down to zero, and the canary release process is marked as failed.

kubectl -n apisix logs deploy/flagger -f | jq .msg

"New revision detected! Scaling up podinfo.test"
"canary deployment podinfo.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available"
"Starting canary analysis for podinfo.test"
"Advance podinfo.test canary weight 10"
"Halt podinfo.test advancement success rate 0.00% < 99%"
"Halt podinfo.test advancement success rate 26.76% < 99%"
"Halt podinfo.test advancement success rate 34.19% < 99%"
"Halt podinfo.test advancement success rate 37.32% < 99%"
"Halt podinfo.test advancement success rate 39.04% < 99%"
"Halt podinfo.test advancement success rate 40.13% < 99%"
"Halt podinfo.test advancement success rate 48.28% < 99%"
"Halt podinfo.test advancement success rate 50.35% < 99%"
"Halt podinfo.test advancement success rate 56.92% < 99%"
"Halt podinfo.test advancement success rate 67.70% < 99%"
"Rolling back podinfo.test failed checks threshold reached 10"
"Canary failed! Scaling down podinfo.test"

Customize Metrics for Canary Analysis

Canary analysis can be extended by querying Prometheus metrics. We customize based on actual business scenarios. Create a metric template and submit it to the cluster.

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: not-found-percentage
  namespace: test
spec:
  provider:
    type: prometheus
    address: http://flagger-prometheus.apisix:9090
  query: |
    sum(
      rate(
        apisix_http_status{
          route=~"{{ namespace }}_{{ route }}-{{ target }}-canary_.+",
          code!~"4.."
        }[{{ interval }}]
      )
    )
    /
    sum(
      rate(
        apisix_http_status{
          route=~"{{ namespace }}_{{ route }}-{{ target }}-canary_.+"
        }[{{ interval }}]
      )
    ) * 100
# Modify the analysis in the canary release and add the indicator template created above.
  analysis:
    metrics:
      - name: "404s percentage"
        templateRef:
          name: not-found-percentage
        thresholdRange:
          max: 5
        interval: 1m

The configuration above validates the canary by checking if the QPS (Queries per second) of HTTP 404 requests is higher than 5% of the total traffic. The canary rollout fails if the HTTP 404 requests exceed the 5% threshold.

Summary

The above process can be extended with more custom metric checks, Webhook, manual approvals and Slack or MS Teams notifications.

A very smooth canary release is achieved through the integration of Apache APISIX and Flagger, which improves release efficiency and reduces release risks. In the future, the two communities will cooperate more closely to realize more publishing capabilities such as Blue/Green Mirroring and A/B Testing.