Keep up APIs Healthy with APISIX and Prometheus

API health checks are part of a proactive approach to monitoring the overall health of your APIs. They ensure you stay informed about your overall API health and can identify any problems during the early stages. In this article, we will explore how APISIX and Prometheus work together to collect and analyze health check data metrics, making it easier to monitor, diagnose, and address API-related issues.

Why does this matter to businesses?

The fact that establishing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) has become a crucial component of site reliability engineering (SRE) best practices. They help the team set clear goals for how well a service (like a website or an app) should work. These goals can be for internal services (like an API used by the company's own apps) or for public products (used by customers). They provide teams with a quantifiable approach to managing the performance of a system. For example, common SLIs include error rate, latency, throughput, and availability or an SLO could be "99.9% of API requests should complete in under 300ms.”

APISIX API Gateway sits at the front of your API infrastructure and can be instrumental in measuring SLIs and SLOs. You do not have to figure out what to measure and how to measure it because it can get problematic, especially in today’s complex and distributed architectures. APISIX automatically tracks all necessary metrics such as latency, unsuccessful requests, or throughput for the upstream services consumed by your APIs. APISIX can perform health checks on the backend services, ensuring they are available to process requests and alert responsible teams to potential issues before they escalate to minimize downtime and improve system reliability.

How does an API gateway health check work?

Generally, activating health checks for APIs is a straightforward process. Each service only requires a designated health check API endpoint (/health). From there, you inspect the most relevant metrics for that service such as memory usage, database connectivity, response duration, and more. You can use observability platforms like Prometheus and Grafana to display the results and an alert system to immediately flag any issues.

One of the benefits of APISIX is that it makes the process of configuration observability tools even easier for multiple services. APISIX periodically send requests to the backend services they manage (also known as upstream nodes). If a healthy status is returned (typically a 200 OK HTTP status code), the service is considered healthy. The gateway might also evaluate the response time, treating a slow response as an indication of potential issues. If the service fails to respond within a specified timeframe, or if it returns an error status, it's marked as unhealthy. It will stop routing traffic to that service to prevent application errors or slowdowns and routes the traffic to a healthy node instead. Learn how to enable health check here.

Collecting health check data with the APISIX Prometheus plugin

APISIX integrates with Prometheus through a plugin called prometheus, offering an efficient way to pull API metrics, including those related to the health status of upstream nodes (multiple instances of a backend API service). Here's how it works:

When the APISIX Prometheus plugin is activated (See how to activate it here), it exposes a metrics URL, typically /apisix/prometheus/metrics. You can also customize the export URI, add extra labels, the frequency of these scrapes, and other parameters by configuring them in conf/config.yamlfile.

plugin_attr:
  prometheus:
    export_uri: /metrics

Prometheus scrapes this URL at specific intervals, collecting time-series data associated with various performance parameters like request count, request latency, upstream latency, and status codes.
With the Prometheus custom metrics functionality we released in APISIX 3.3.0 version, you can now expose more granular metrics data for your APIs. This mechanism allows APISIX to periodically check whether upstream nodes are healthy or not and adjust the routing accordingly. It can help prevent failures and improve the system's reliability, which is critical for any API-based infrastructure. The results of these health checks are incorporated in the metrics that the Prometheus plugin exposes, providing a comprehensive and real-time view of your APIs' performance. For example, if you send a simple request to the APISIX Gateway /metrics endpoint, you can observe the collected monitoring data and health check result status of upstream nodes.

curl <http://127.0.0.1:9091/metrics>

...
# HELP apisix_upstream_status Upstream status from health check
# TYPE apisix_upstream_status gauge
apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.5",port="443"} 0
apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.5",port="80"} 1
apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.7",port="443"} 0
apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.7",port="80"} 1

A value of 1 represents healthy and 0 means the upstream node is unhealthy.

You can also see the output of the health check statuses of upstream nodes on the Prometheus dashboard:

APISIX Prometheus plugin on dashboard

APISIX Prometheus plugin is configured to connect Grafana automatically to visualize these metrics.

Just as importantly, it is also possible to enable prometheus to collect metrics for TCP/UDP. Because observability at the transport layer provides insights into how data is transmitted between services in your infrastructure, and can be pivotal in diagnosing issues and optimizing performance.

Do customization on the Prometheus plugin

In the context of APISIX, the Prometheus plugin exposes several metrics by default. These metrics are configurable, and the plugin can be extended to add additional metrics based on specific requirements. API7.ai team is always on hand to answer any questions you may have about API health checking and monitoring and our engineers are actively supporting new APISIX users to onboard and help them to modify the APISIX default configs according to their needs.

Real-world use case: Fast-Food Giant Improves Server Health Monitoring with APISIX and Prometheus Integration

Assume that there is a leading global fast-food chain with thousands of branches worldwide (henceforth referred to as "Company X") was keen on achieving active-active server configuration. Their goal was to ensure that all servers or data centers can share the workload in real time without causing service disruptions.

The company's technology team had automated the switching process between servers or data centers. However, there were occasions when business traffic varied between the active servers, and the load was unevenly distributed. Some servers were overloaded, and others received less traffic, leading to operational inefficiencies. During peak times, this led to server crashes and service disruptions, affecting the company's digital operations.

APISIX allowed the company to continually monitor the health of its upstream servers/data centers and automatically switch traffic based on server health status. If a server is considered unhealthy, the system can automatically switch to another healthy server to maintain uninterrupted service. In specific scenarios where the traffic was unusually small or too large for a server to handle, Prometheus's alerting mechanism triggered alarms. This integration facilitated Company X's operations team to monitor server health statuses, traffic loads, and other critical metrics proactively.

Wrap up

To sum up, integrating APISIX and Prometheus to gain health check data metrics can significantly improve your metrics ecosystem, giving you a deeper understanding of your APIs' health status. This can ultimately lead to better business outcomes, such as improved operational efficiency, higher customer satisfaction, and increased revenue. So, if you're looking to level up your metrics ecosystem, consider leveraging the strength of APISIX and Prometheus.