1s to 10ms: Reducing Prometheus Delay in API Gateway

Requests With Delays of 1 to 2 Seconds

In the APISIX community, users have reported a mysterious phenomenon: when traffic requests enter a normally deployed APISIX cluster, a small number of requests have an exceptionally long delay of 1 to 2 seconds. The user's QPS scale is about 10,000, and this kind of abnormal request is rare, appearing only 1 to 3 times every few minutes.

Users provided their captures of high-latency requests in issues. It can be seen from these screenshots that there are indeed requests with high latency, some can even reach the second level.

High Latency Requests 1.png

High Latency Requests 2.png

Another noticeable phenomenon accompanies this: one of the worker process’s CPU usage reaches 100%.

100% CPU.png

The development team learned the conditions for this phenomenon to occur:

The Prometheus plugin enabled, and have Prometheus Exporter access APISIX endpoint /apisix/prometheus/metrics to collect metrics.
The number of metrics the Prometheus plugin counts reaches a certain scale, usually tens of thousands.

This phenomenon is called "long tail requests" in the industry. It refers to the situation when a small percentile of requests have an abnormally long request time while most requests have a normal response time in a request group. It could be due to a performance bottleneck in the backend system, insufficient resources, or other reasons. It's not a fatal bug, but it seriously affects the end-user experience.

Breaking It Down

Based on an open-source Lua library nginx-lua-prometheus, APISIX has developed a Prometheus plugin that provides the function of tracking and collecting metrics. When Prometheus Exporter accesses the endpoint of Prometheus indicators exposed by APISIX, APISIX will call the function provided by the nginx-lua-prometheus library to expose the calculation results of metrics.

The development team located the problem: the function prometheus:metric_data() used to expose metrics indicators in nginx-lua-prometheus.

However, this is only a preliminary inference, and direct evidence is needed to prove that long tail requests are related to this, and the following issues need to be clarified:

What exactly does this function do?
Why does this function cause long tail requests?

The development team constructed a local environment to reproduce the problem, which mainly simulates the following scenarios:

The simulated client sends a normal request proxied to the upstream by APISIX.
Simulate that Prometheus Exporter visits /apisix/prometheus/metrics every 5 seconds, triggering APISIX to call prometheus:metric_data() function.

Reproduced Environment:

Reproduced Environment.png

When performing the test, we observed P100 and other indicators in the test results of wrk2 to confirm whether there is a long tail request phenomenon. We generated a flame graph of APISIX to observe the CPU usage when a long tail request occurs.

Test results from wrk2:

  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
   50.000%    1.13ms
   75.000%    2.56ms
   90.000%    4.82ms
   99.000%   14.70ms
   99.900%   27.95ms
   99.990%   74.75ms
   99.999%  102.78ms
  100.000%  102.78ms

According to this test result, it can be concluded that during the test, 99% of the requests were completed within 14.70 milliseconds, and a small number of requests took more than 100 milliseconds. And we used the number of metrics as a variable to conduct multiple tests, and found that the number of metrics increased linearly with the delay of P100. If the metrics reach the 100,000 level, the P100 will reach the second level.

The resulting flame graph:

Flame Graph1.png

As seen from the function stack of the flame graph, prometheus:metric_data() occupies the longest horizontal width, proving a lot of CPU consumption, and directly determines that prometheus:metric_data() causes long tail requests.

Let's briefly analyze what the prometheus:metric_data() function does.

prometheus:metric_data() will fetch metrics from shared memory, categorize them, and process them into a Prometheus-compatible text format. In this process, all metrics will be sorted in lexicographical order, and the prefix of metrics will be processed with regular expressions. As a rule of thumb, these are very expensive operations.

An Imperfect Optimization Attempt

Once we have located the code that caused the delay, the next step is to combine the flame graph to analyze the code in detail and find room for optimization.

From the flame graph, we located the fix_histogram_bucket_labels function. Reviewing this function, we found two sensitive functions: string:match and string:gusb. Neither of these two functions can be JIT compiled by LuaJIT. They can only be interpreted and executed.

LuaJIT is a JIT compiler for the Lua programming language, which can compile Lua code into machine code and run it. This can provide higher performance than using an interpreter to run Lua code.

One of the advantages of using LuaJIT to run Lua code is that it can greatly increase the execution speed of the code. This allows APISIX to maintain low latency when processing a large number of requests, and to perform better in a high-concurrency environment.

Therefore, code that LuaJIT cannot compile will become a potential performance bottleneck.

We compiled the above information and submitted issue: optimize the long tail request phenomenon to nginx-lua-prometheus, where we get connected with the author Knyar of this project and explored the space that could be optimized. Knyar responded promptly, and we clarified the places that could be optimized. So a PR was submitted: chore: use ngx.re.match instead of string match to improve performance for optimization.

In this PR, we mainly completed:

Use ngx.re.match instead of string:match
Replace string:gusb with ngx.re.gsub

After completing this optimization, we know that the performance it improves is limited, it cannot fundamentally solve the problem. The root problem is:

NGINX is a multi-process single-threaded architecture. All worker processes will monitor TCP (Transmission Control Protocol) connections, but once a connection enters a worker process, it cannot be migrated to other worker processes for processing.

This means that if a worker process is very busy, other connections within that worker process may not be processed on time. On the other hand, the single-threaded model within a process means that all CPU-intensive and IO-intensive tasks must be performed sequentially. If a task takes a long time to execute, other tasks may be ignored, resulting in uneven task processing times.

prometheus:metric_data() occupies a large amount of CPU time slices for calculation, squeezing CPU resources for processing normal requests. This is why you see a worker process at 100% CPU usage.

Based on this problem, we continued to analyze after completing the above optimization, and captured the flame graph:

Flame Graph2.png

The above flame graph builtin#100 represents luajit/lua library functions (such as string.find), which can be accessed through https://github.com/openresty/openresty-devel-utils/blob/master/ljff.lua in this project to get the corresponding function name.

How to use:

$ luajit ljff.lua 100
FastFunc table.sort

Since the calculation of metrics takes up an excessive amount of CPU, we consider giving up CPU time slices when calculating metrics.

For APISIX, the priority of processing normal requests is the highest, and CPU resources should be tilted towards this. At the same time, prometheus:metric_data() will only affect the efficiency of obtaining Prometheus Exporter metrics.

In the OpenResty world, there is a hidden way to yield CPU time slices: ngx.sleep(0). We introduce this method in prometheus:metric_data(). When processing metrics, we will give up CPU time slices after processing a certain number of metrics (such as 200 metrics) so that new incoming requests will have a chance to be processed.

We submitted a PR for this optimization: feat: performance optimization.

In our test scenario, when the total number of metrics reaches 100,000 levels, the results obtained by testing with wrk2 before introducing this optimization:

  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
 50.000%   10.21ms
 75.000%   12.03ms
 90.000%   13.25ms
 99.000%   92.80ms
 99.900%  926.72ms
 99.990%  932.86ms
 99.999%  934.40ms
100.000%  934.91ms

Wrk2 results after the optimization:

  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
 50.000%    4.34ms
 75.000%   12.81ms
 90.000%   16.12ms
 99.000%   82.75ms
 99.900%  246.91ms
 99.990%  349.44ms
 99.999%  390.40ms
100.000%  397.31ms

The P100 index is about 1/3 to 1/2 of that before optimization.

However, this does not solve this problem perfectly. By analyzing the optimized flame graph:

Flame Graph After Optimization.png

You can see that builtin#100 (table.sort) and builtin#92 (string.format) still occupy a considerable width of the horizontal axis because:

prometheus:metric_data() will first call table.sort to sort all the metrics. When the metrics reach 100,000 levels, it is equivalent to sorting 100,000 strings, and table.sort cannot be interrupted by ngx.sleep(0).
The place where string.format and fix_histogram_bucket_labels are used cannot be optimized. After communicating with knyar, I learned that these steps are necessary to ensure that prometheus:metric_data() can produce metrics in the correct format.

So far, the optimization methods at the code level have been exhausted, but unfortunately, the problem is still not solved perfectly. There is still a noticeable lag in the P100's metrics.

What to Do?

Let's go back to the core issue: prometheus:metric_data() occupies a large amount of CPU time slices for calculation, squeezing CPU resources for processing normal requests.

In the Linux system, the time slices that the CPU is allocating a thread or a process? To be precise, it is a thread, and the thread is the actual unit of work. However, NGINX is a multi-process single-threaded architecture with only one thread in each process.

At this point we thought of an optimization direction: move prometheus:metric_data() to other threads (or, more precisely, processes). So we investigated two directions:

Use ngx.run_worker_thread to run prometheus:metric_data()calculation tasks, equivalent to handing over CPU-intensive tasks to the thread pool.
Use a separate process to handle the calculation tasks of prometheus:metric_data(), this process will not handle normal requests.

After a PoC (proof of concept), we rejected option 1 and adopted option 2. Option 1 is rejected because ngx.run_worker_thread is only suitable for running request-independent computing tasks, while prometheus:metric_data() is request-related.

Implementation of option 2: Let privileged agent (privileged process) handle prometheus:metric_data(). But the privileged process does not listen to any ports or handle any requests. Therefore, we need to modify the privileged process so that it can listen to the port.

Finally, we used feat: allow privileged agent to listen port and feat(prometheus): support collect metrics works in the privileged agent to implement option 2.

We used APISIX with this optimization to test, and found that the index delay of P100 has been reduced to a reasonable range, and the problem of long tail requests does not exist anymore.

  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
 50.000%    3.74ms
 75.000%    4.66ms
 90.000%    5.77ms
 99.000%    9.99ms
 99.900%   13.41ms
 99.990%   16.77ms
 99.999%   18.38ms
100.000%   18.40ms

This is a delicate solution and solves the core problem. We observed and verified this solution in the production environment. It eliminated the long tail request phenomenon, and did not cause additional exceptions.

Summary

When we were fixing this problem, a new thought arose: is the open-source library nginx-lua-prometheus suitable for APISIX?

We solved the problem of prometheus:metric_data() on the APISIX side. At the same time, we also found other problems in nginx-lua-prometheus and fixed them. Such as fix memory leak, and LRU cache elimination.

nginx-lua-prometheus was originally designed for NGINX, not for OpenResty and OpenResty-based applications. However, no open-source project in the OpenResty ecosystem connects to the Prometheus ecosystem more maturely than nginx-lua-prometheus. Therefore, nginx-lua-prometheus has been constantly promoted by the open-source community for the OpenResty ecosystem.

Perhaps we should explore more and find a way to connect to the Prometheus ecosystem without modifying the source code of APISIX. For example, design a dependency library that is more suitable for APISIX, or connect to mature projects in the Prometheus ecosystem in a certain way, and move the process of collecting and calculating metrics to those mature projects.