1s to 10ms: Reducing Prometheus Delay in API Gateway
ZhengSong Tu
January 31, 2023
Requests With Delays of 1 to 2 Seconds
In the APISIX community, users have reported a mysterious phenomenon: when traffic requests enter a normally deployed APISIX cluster, a small number of requests have an exceptionally long delay of 1 to 2 seconds. The user's QPS scale is about 10,000, and this kind of abnormal request is rare, appearing only 1 to 3 times every few minutes.
Users provided their captures of high-latency requests in issues. It can be seen from these screenshots that there are indeed requests with high latency, some can even reach the second level.
Another noticeable phenomenon accompanies this: one of the worker process’s CPU usage reaches 100%.
The development team learned the conditions for this phenomenon to occur:
- The Prometheus plugin enabled, and have Prometheus Exporter access APISIX endpoint
/apisix/prometheus/metrics
to collect metrics. - The number of metrics the Prometheus plugin counts reaches a certain scale, usually tens of thousands.
This phenomenon is called "long tail requests" in the industry. It refers to the situation when a small percentile of requests have an abnormally long request time while most requests have a normal response time in a request group. It could be due to a performance bottleneck in the backend system, insufficient resources, or other reasons. It's not a fatal bug, but it seriously affects the end-user experience.
Breaking It Down
Based on an open-source Lua library nginx-lua-prometheus, APISIX has developed a Prometheus plugin that provides the function of tracking and collecting metrics. When Prometheus Exporter accesses the endpoint of Prometheus indicators exposed by APISIX, APISIX will call the function provided by the nginx-lua-prometheus library to expose the calculation results of metrics.
The development team located the problem: the function prometheus:metric_data()
used to expose metrics indicators in nginx-lua-prometheus.
However, this is only a preliminary inference, and direct evidence is needed to prove that long tail requests are related to this, and the following issues need to be clarified:
- What exactly does this function do?
- Why does this function cause long tail requests?
The development team constructed a local environment to reproduce the problem, which mainly simulates the following scenarios:
- The simulated client sends a normal request proxied to the upstream by APISIX.
- Simulate that Prometheus Exporter visits
/apisix/prometheus/metrics
every 5 seconds, triggering APISIX to callprometheus:metric_data()
function.
Reproduced Environment:
When performing the test, we observed P100 and other indicators in the test results of wrk2 to confirm whether there is a long tail request phenomenon. We generated a flame graph of APISIX to observe the CPU usage when a long tail request occurs.
Test results from wrk2:
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
50.000% 1.13ms
75.000% 2.56ms
90.000% 4.82ms
99.000% 14.70ms
99.900% 27.95ms
99.990% 74.75ms
99.999% 102.78ms
100.000% 102.78ms
According to this test result, it can be concluded that during the test, 99% of the requests were completed within 14.70 milliseconds, and a small number of requests took more than 100 milliseconds. And we used the number of metrics as a variable to conduct multiple tests, and found that the number of metrics increased linearly with the delay of P100. If the metrics reach the 100,000 level, the P100 will reach the second level.
The resulting flame graph:
As seen from the function stack of the flame graph, prometheus:metric_data()
occupies the longest horizontal width, proving a lot of CPU consumption, and directly determines that prometheus:metric_data()
causes long tail requests.
Let's briefly analyze what the prometheus:metric_data()
function does.
prometheus:metric_data()
will fetch metrics from shared memory, categorize them, and process them into a Prometheus-compatible text format. In this process, all metrics will be sorted in lexicographical order, and the prefix of metrics will be processed with regular expressions. As a rule of thumb, these are very expensive operations.
An Imperfect Optimization Attempt
Once we have located the code that caused the delay, the next step is to combine the flame graph to analyze the code in detail and find room for optimization.
From the flame graph, we located the fix_histogram_bucket_labels
function. Reviewing this function, we found two sensitive functions: string:match
and string:gusb
. Neither of these two functions can be JIT compiled by LuaJIT. They can only be interpreted and executed.
LuaJIT is a JIT compiler for the Lua programming language, which can compile Lua code into machine code and run it. This can provide higher performance than using an interpreter to run Lua code.
One of the advantages of using LuaJIT to run Lua code is that it can greatly increase the execution speed of the code. This allows APISIX to maintain low latency when processing a large number of requests, and to perform better in a high-concurrency environment.
Therefore, code that LuaJIT cannot compile will become a potential performance bottleneck.
We compiled the above information and submitted issue: optimize the long tail request phenomenon to nginx-lua-prometheus, where we get connected with the author Knyar of this project and explored the space that could be optimized. Knyar responded promptly, and we clarified the places that could be optimized. So a PR was submitted: chore: use ngx.re.match instead of string match to improve performance for optimization.
In this PR, we mainly completed:
- Use
ngx.re.match
instead ofstring:match
- Replace
string:gusb
withngx.re.gsub
After completing this optimization, we know that the performance it improves is limited, it cannot fundamentally solve the problem. The root problem is:
NGINX is a multi-process single-threaded architecture. All worker processes will monitor TCP (Transmission Control Protocol) connections, but once a connection enters a worker process, it cannot be migrated to other worker processes for processing.
This means that if a worker process is very busy, other connections within that worker process may not be processed on time. On the other hand, the single-threaded model within a process means that all CPU-intensive and IO-intensive tasks must be performed sequentially. If a task takes a long time to execute, other tasks may be ignored, resulting in uneven task processing times.
prometheus:metric_data()
occupies a large amount of CPU time slices for calculation, squeezing CPU resources for processing normal requests. This is why you see a worker process at 100% CPU usage.
Based on this problem, we continued to analyze after completing the above optimization, and captured the flame graph:
The above flame graph builtin#100
represents luajit/lua library functions (such as string.find
), which can be accessed through https://github.com/openresty/openresty-devel-utils/blob/master/ljff.lua in this project to get the corresponding function name.
How to use:
$ luajit ljff.lua 100
FastFunc table.sort
Since the calculation of metrics takes up an excessive amount of CPU, we consider giving up CPU time slices when calculating metrics.
For APISIX, the priority of processing normal requests is the highest, and CPU resources should be tilted towards this. At the same time, prometheus:metric_data()
will only affect the efficiency of obtaining Prometheus Exporter metrics.
In the OpenResty world, there is a hidden way to yield CPU time slices: ngx.sleep(0)
. We introduce this method in prometheus:metric_data()
. When processing metrics, we will give up CPU time slices after processing a certain number of metrics (such as 200 metrics) so that new incoming requests will have a chance to be processed.
We submitted a PR for this optimization: feat: performance optimization.
In our test scenario, when the total number of metrics reaches 100,000 levels, the results obtained by testing with wrk2 before introducing this optimization:
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
50.000% 10.21ms
75.000% 12.03ms
90.000% 13.25ms
99.000% 92.80ms
99.900% 926.72ms
99.990% 932.86ms
99.999% 934.40ms
100.000% 934.91ms
Wrk2 results after the optimization:
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
50.000% 4.34ms
75.000% 12.81ms
90.000% 16.12ms
99.000% 82.75ms
99.900% 246.91ms
99.990% 349.44ms
99.999% 390.40ms
100.000% 397.31ms
The P100 index is about 1/3 to 1/2 of that before optimization.
However, this does not solve this problem perfectly. By analyzing the optimized flame graph:
You can see that builtin#100
(table.sort
) and builtin#92
(string.format
) still occupy a considerable width of the horizontal axis because:
prometheus:metric_data()
will first calltable.sort
to sort all the metrics. When the metrics reach 100,000 levels, it is equivalent to sorting 100,000 strings, andtable.sort
cannot be interrupted byngx.sleep(0)
.- The place where
string.format
andfix_histogram_bucket_labels
are used cannot be optimized. After communicating with knyar, I learned that these steps are necessary to ensure thatprometheus:metric_data()
can produce metrics in the correct format.
So far, the optimization methods at the code level have been exhausted, but unfortunately, the problem is still not solved perfectly. There is still a noticeable lag in the P100's metrics.
What to Do?
Let's go back to the core issue: prometheus:metric_data()
occupies a large amount of CPU time slices for calculation, squeezing CPU resources for processing normal requests.
In the Linux system, the time slices that the CPU is allocating a thread or a process? To be precise, it is a thread, and the thread is the actual unit of work. However, NGINX is a multi-process single-threaded architecture with only one thread in each process.
At this point we thought of an optimization direction: move prometheus:metric_data()
to other threads (or, more precisely, processes). So we investigated two directions:
- Use
ngx.run_worker_thread
to runprometheus:metric_data()
calculation tasks, equivalent to handing over CPU-intensive tasks to the thread pool. - Use a separate process to handle the calculation tasks of
prometheus:metric_data()
, this process will not handle normal requests.
After a PoC (proof of concept), we rejected option 1 and adopted option 2. Option 1 is rejected because ngx.run_worker_thread
is only suitable for running request-independent computing tasks, while prometheus:metric_data()
is request-related.
Implementation of option 2: Let privileged agent
(privileged process) handle prometheus:metric_data()
. But the privileged process does not listen to any ports or handle any requests. Therefore, we need to modify the privileged process so that it can listen to the port.
Finally, we used feat: allow privileged agent to listen port and feat(prometheus): support collect metrics works in the privileged agent to implement option 2.
We used APISIX with this optimization to test, and found that the index delay of P100 has been reduced to a reasonable range, and the problem of long tail requests does not exist anymore.
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
50.000% 3.74ms
75.000% 4.66ms
90.000% 5.77ms
99.000% 9.99ms
99.900% 13.41ms
99.990% 16.77ms
99.999% 18.38ms
100.000% 18.40ms
This is a delicate solution and solves the core problem. We observed and verified this solution in the production environment. It eliminated the long tail request phenomenon, and did not cause additional exceptions.
Summary
When we were fixing this problem, a new thought arose: is the open-source library nginx-lua-prometheus suitable for APISIX?
We solved the problem of prometheus:metric_data()
on the APISIX side. At the same time, we also found other problems in nginx-lua-prometheus and fixed them. Such as fix memory leak, and LRU cache elimination.
nginx-lua-prometheus was originally designed for NGINX, not for OpenResty and OpenResty-based applications. However, no open-source project in the OpenResty ecosystem connects to the Prometheus ecosystem more maturely than nginx-lua-prometheus. Therefore, nginx-lua-prometheus has been constantly promoted by the open-source community for the OpenResty ecosystem.
Perhaps we should explore more and find a way to connect to the Prometheus ecosystem without modifying the source code of APISIX. For example, design a dependency library that is more suitable for APISIX, or connect to mature projects in the Prometheus ecosystem in a certain way, and move the process of collecting and calculating metrics to those mature projects.