`systemtap-toolkit` and `stapxx`: How to Use Data to Solve Difficult Problems?


December 22, 2022

OpenResty (NGINX + Lua)

As introduced in the previous article, as server-side development engineers, we don't do in-depth learning of dynamic debugging toolsets, but mostly stay at the level of usage and write some simple staple scripts at most. The lower level, such as CPU cache, architecture, compiler, etc., is the domain of performance engineers.

There are two open-source projects in OpenResty: openresty-systemtap-toolkit and stapxx. They are systemtap-based wrapped toolsets for real-time analysis and diagnosis of NGINX and OpenResty. They can cover on-CPU, off-CPU, shared dictionary, garbage collection, request latency, memory pooling, connection pooling, file access, and other common features and debugging scenarios.

In this article, we will browse these tools and their corresponding usage, to help us quickly find tools to locate problems when we encounter troubleshooting issues with NGINX and OpenResty. In the world of OpenResty, learning to use these tools is a surefire way to get ahead, and a very effective way to communicate with other developers - after all, the data generated by the tools will be more accurate and detailed than we can describe in words.

However, it is essential to note that LuaJIT GC64 mode is enabled by default in OpenResty version 1.15.8, but openresty-systemtap-toolkit and stapxx do not follow the corresponding changes, which will cause the tools inside not to work correctly. So, we'd better use these tools in the old 1.13 version of OpenResty.

Most open-source project contributors are part-time and are not obligated to keep the tools working, which we need to be aware of when using open-source projects.

Example: shared dict

Let's start with a tool we are probably most familiar with and easiest to get started with, ngx-lua-shdict, as an example to kick off today's post.

ngx-lua-shdict is a tool that analyzes NGINX's shared dicts and tracks their operations. You can use the -f option to specify the dict and key to get the data in the shared dict. The --raw option allows you to export the raw value of the specified key.

The following is a command-line example of getting data from a shared dict.

# Assuming the NGINX Worker PID is 5050
$ ./ngx-lua-shdict -p 5050 -f --dict dogs --key Jim --luajit20
Tracing 5050 (/opt/nginx/sbin/nginx)...

value: true
expires: 1372719243270
flags: 0xa

Similarly, we can use the -w option to keep track of dict writes for a given key:

$./ngx-lua-shdict -p 5050 -w --key Jim --luajit20
Tracing 5050 (/opt/nginx/sbin/nginx)...

Hit Ctrl-C to end

set Jim exptime=4626322717216342016
replace Jim exptime=4626322717216342016

Let's see how this tool is implemented. ngx-lua-shdict is a Perl script, but the implementation has nothing to do with Perl, which is just used to generate the stap script and run it.

open my $in, "|stap $stap_args -x $pid -" or die "Cannot run stap: $!\n";

We can write it in Python, PHP, Go, or any language we like. The key point in the stap script is the following line of code:

probe process("$nginx_path").function("ngx_http_lua_shdict_set_helper")

This is the probe we mentioned before, probing the function ngx_http_lua_shdict_set_helper. The calls to this function are all in the lua-nginx-module module's lua-nginx-module/src/ngx_http_lua_shdict.c file.

static int
ngx_http_lua_shdict_add(lua_State *L)
return ngx_http_lua_shdict_set_helper(L, NGX_HTTP_LUA_SHDICT_ADD);

static int
ngx_http_lua_shdict_safe_add(lua_State *L)
return ngx_http_lua_shdict_set_helper(L, NGX_HTTP_LUA_SHDICT_ADD

static int
ngx_http_lua_shdict_replace(lua_State *L)
return ngx_http_lua_shdict_set_helper(L, NGX_HTTP_LUA_SHDICT_REPLACE);

In this way, we can trace all the operations of the shared dict by simply probing this function.

on-CPU, off-CPU

In using OpenResty, the most common problem we encounter should be performance. There are two main types of poor performance, i.e., low QPS: too high CPU usage and too low CPU usage. The former bottleneck may be due to not using the performance optimization methods we introduced earlier, while the latter may be due to blocking functions. The corresponding on-CPU and off-CPU flame graphs can help us identify the ultimate root cause.

To generate C-level on-CPU flame graphs, you need to use sample-bt in systemtap-toolkit; while Lua-level on-CPU flame graphs are generated by lj-lua-stacks in stapxx.

Let's take sample-bt as an example of how to use it. sample-bt is a script that samples the call stack of any user process we specify (not just NGINX and OpenResty processes).

For example, we can sample a running NGINX Worker process (with a PID of 8736) for 5 seconds with the following code:

$ ./sample-bt -p 8736 -t 5 -u > a.bt
WARNING: Tracing 8736 (/opt/nginx/sbin/nginx) in user-space only...
WARNING: Missing unwind data for module, rerun with 'stap -d stap_df60590ce8827444bfebaf5ea938b5a_11577'
WARNING: Time's up. Quitting now...(it may take a while)
WARNING: Number of errors: 0, skipped probes: 24

It outputs the result file a.bt, which can be used to generate a flame graph using the FlameGraph toolset:

stackcollapse-stap.pl a.bt > a.cbt
flamegraph.pl a.cbt > a.svg

The a.svg is the generated flame graph, which we can open in the browser to view. However, during the sampling period, we must maintain a certain number of requests. Otherwise, there will be no way to generate the flame graph if the sample count is 0.

Next, we'll look at how to sample off-CPU, using the script sample-bt-off-cpu in systemtap-toolkit, which is similar to sample-bt, as follows:

$ ./sample-bt-off-cpu -p 10901 -t 5 > a.bt
WARNING: Tracing 10901 (/opt/nginx/sbin/nginx)...
WARNING: _stp_read_address failed to access memory location
WARNING: Time's up. Quitting now...(it may take a while)
WARNING: Number of errors: 0, skipped probes: 23

In stapxx, the tool for analyzing latency is epoll-loop-blocking-distr, which samples the specified user process and outputs the latency distribution between successive epoll_wait system calls.

$ ./samples/epoll-loop-blocking-distr.sxx -x 19647 --arg time=60
Start tracing 19647...
Please wait for 60 seconds.
Distribution of epoll loop blocking latencies (in milliseconds)
max/avg/min: 1097/0/0
value |-------------------------------------------------- count
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  18471
    1 |@@@@@@@@                                            3273
    2 |@                                                    473
    4 |                                                     119
    8 |                                                      67
   16 |                                                      51
   32 |                                                      35
   64 |                                                      20
  128 |                                                      23
  256 |                                                       9
  512 |                                                       2
 1024 |                                                       2
 2048 |                                                       0
 4096 |                                                       0

As we can see, this output shows that the vast majority of delays are less than 1 ms, but there are a few that are above 200 ms, and these are the ones to watch.

Upstream and Phase Tracking

In addition to performance issues that may occur in OpenResty's code itself, when OpenResty communicates with upstream services through upstream modules like cosocket or proxy_pass, if the upstream service itself has a large latency, it can also have a significant impact on the overall performance.

At this time, we can use the tools ngx-lua-tcp-recv-time, ngx-lua-udp-recv-time and ngx-single-req-latency for analysis. Here is an example of ngx-single-req-latency.

This tool is not quite the same as most tools inside the toolset. Most other tools are based on a large number of samples and statistical analysis to draw a mathematical conclusion about the distribution. Instead, ngx-single-req-latency analyzes individual requests and tracks the time spent on individual requests at various phases in OpenResty, such as the rewrite, access, and content phases, as well as upstream.

We can look at a specific example code:

# making the ./stap++ tool visible in PATH:
$ export PATH=$PWD:$PATH

# assuming an nginx worker process's pid is 27327
$ ./samples/ngx-single-req-latency.sxx -x 27327
Start tracing process 27327 (/opt/nginx/sbin/nginx)...

POST /api_json
    total: 143596us, accept() ~ header-read: 43048us, rewrite: 8us, pre-access: 7us, access: 6us, content: 100507us
    upstream: connect=29us, time-to-first-byte=99157us, read=103us

$ ./samples/ngx-single-req-latency.sxx -x 27327
Start tracing process 27327 (/opt/nginx/sbin/nginx)...

GET /robots.txt
    total: 61198us, accept() ~ header-read: 33410us, rewrite: 7us, pre-access: 7us, access: 5us, content: 27750us
    upstream: connect=30us, time-to-first-byte=18955us, read=96us

This tool will track the first request it encounters after it starts. The output is very similar to opentracing, and we can even think of systemtap-toolkit and stapxx, as non-intrusive versions of APM (Application Performance Management) in OpenResty.


In addition to the common tools I've talked about today, OpenResty naturally provides many more tools that you can explore and learn on your own.

There is one more significant difference between OpenResty and other development languages and platforms in terms of tracing technology, which we can slowly appreciate.

Keep the code clean and stable, don't add probes to it, but sample it through external dynamic tracking techniques.

What tools have you used to track and analyze issues while using OpenResty? You are welcome to share this article, and we will share and progress together.