Various Debugging Methods in OpenResty

API7.ai

December 16, 2022

OpenResty (NGINX + Lua)

In OpenResty's communication group, developers often ask this question: How do debug in OpenResty? As far as I know, there are some tools in OpenResty that support breakpoint debugging, including a plugin in VSCode, but they are not widely used so far. Including the author agentzh and a few contributors I know, everyone uses the simplest ngx.log and ngx.say to do debugging.

This is not friendly to most newbies. Does it mean that the many core maintainers of OpenResty have only the primitive method of printing logs when they encounter a difficult problem?

Of course not. In the OpenResty world, SystemTap and flame graph are the standard tools for dealing with tough problems and performance issues. If you have a question about this on a mailing list or issue, the project maintainer will ask you to upload a flame graph and ask for a graphical rather than a textual description.

In the next two articles, I'll talk to you about debugging, and the toolset OpenResty created specifically for debugging. Today we'll start by looking at what's available for debugging programs.

Breakpoints and Print logs

For a long time in my work, I relied on the IDE (integrated development environment) advanced debugging features to trace programs, which seemed natural. For issues that can be reproduced in a test environment, no matter how complex, I am confident that I can get to the root of the problem. The reason is that the bug can be reproduced repeatedly, the cause of which can be found by setting breakpoints and printing logs. All you need is patience.

From this point of view, solving steadily recurring bugs in a test environment is a physical job. Most of the bugs I solve in my work fall into this category.

However, note that there are two prerequisites: a test environment and stable reproduction. The reality is always less than ideal. If the bug is only reproduced in the production environment, is there a way to debug it?

Here I recommend a tool - Mozilla RR. You can take it as a repeater, record the program's behavior, and then repeatedly replay it. To be frank, regardless of the production environment or test environment, as long as you can record the "evidence" of the bug, it can be used as "evidence in court" to analyze slowly.

Binary Search Algorithm and Comment

However, for some large projects, for example, the bug may come from one of the multiple services, or there may be a problem with the SQL statement querying the database, in this case, even if the bug can be reproduced steadily, you can't be sure which part of the bug occurred. So, recording tools like Mozilla RR fail.

At this point, you may recall the classic "binary search algorithm". We first comment out half of the logic in the code, and if the problem persists, the bug is in the uncommented code, so we comment out the remaining half of the logic and continue the loop. Within a few times, the problem is narrowed down to a completely manageable size.

This approach may sound a bit dumb, but it is efficient in many scenarios. Of course, as technology advances and system complexity increases, we recommend using a standard like OpenTracing for distributed tracing.

OpenTracing can be buried in various parts of the system and report the call chain and event tracking composed of multiple Spans to the server through Trace ID for analysis and graphical presentation. This can help developers find many hidden problems, and the historical data will be saved so that we can compare and view them at any time.

Also, if your system is more complex, such as in a microservices environment, then Zipkin, Apache SkyWalking are good choices.

Dynamic Debugging

The debugging methods I've described above are enough to solve most of the problems. However, if you encounter a fault that only happens occasionally in production, it will take quite a lot of time to track it down by adding logs and event tracking.

Years ago, I was responsible for a system that ran out of database resources at about 1:00 a.m. every day and caused the whole system to avalanche. At that time, we checked the scheduled tasks in the code during the day, and at night, the team was waiting for the bug to be reproduced in the company, and then checked the running status of the submodules when it was reproduced. We didn't find the cause of the bug until the third night.

My experience is similar to the background of several Solaris system engineers who created Dtrace. At that time, the Solaris engineers also spent days and nights troubleshooting a weird production problem, only to find out that it was because a configuration was written wrong. But unlike me, the Solaris engineers decided to avoid this problem altogether and invented Dtrace, specifically for dynamic debugging.

Unlike static debugging tools like GDB, dynamic debugging can debug online services. The whole debugging process is non-sensitive and non-intrusive for the debugged program, without modifying the code, let alone restarting. To use an analogy, dynamic debugging is like an X-ray, which can examine the patient's body without the need for blood sampling and gastroscopy.

Dtrace was one of the first dynamic tracing frameworks, and its influence has led to the emergence of similar dynamic debugging tools on other systems. For example, the engineers at Red Hat created Systemtap on Linux, which is what I'm going to talk about next.

Systemtap

Systemtap has its DSL, which can be used to set up probe points. Before we go into more details, let's install Systemtap to go beyond the abstract. Here, just use the system's package manager to install.

sudo apt install systemtap

Let's look at what a hello world program written in Systemtap:

# cat hello-world.stp
probe begin
{
  print("hello world!")
  exit()
}

Doesn't it look easy? You need to use sudo privileges to run.

sudo stap hello-world.stp

It will print out the hello world!. In most scenarios, we don't need to write our stap scripts to do the analysis, because OpenResty already has a lot of ready-made stap scripts to do the regular analysis, and I'll introduce you to them in the next article. So, today we need to have a brief understanding of stap scripts.

After some practice, back to our concept, Systemtap works by converting the above stap script to C and running the system C compiler to create the kernel module. When the module is loaded, it activates all probe events by hooking the kernel.

For example, begin will run at the beginning of the probe, and the corresponding end, so the hello world program above can also be written in the following way:

probe begin
{
  print("hello ")
  exit()
}

probe end
{
print("world!")

Here, I have only given a very cursory introduction to Systemtap. Frank Ch. Eigler, the author of Systemtap, wrote an e-book Systemtap tutorial which introduces Systemtap in detail. If you want to learn more and understand Systemtap in-depth, I suggest starting with this book as the best learning path.

Other dynamic tracking frameworks

Systemtap is not enough for kernel and performance analysis engineers.

  1. Systemtap does not enter the system kernel by default.
  2. It works in such a way that it is slow to boot and may have an impact on the normal operation of the system.

eBPF (extended BPF) is a new feature added to the Linux kernel in recent years. Compared to Systemtap, eBPF has the advantages of direct kernel support, no crashes, and fast startup. At the same time, it does not use DSL, but C syntax directly, so it is much easier to start.

In addition to open-source solutions, Intel's VTune is also one of the best tools. Its intuitive interface operation and data presentation allow you to analyze the performance bottlenecks without writing code.

Flame Graph

Finally, let's recall the flame graph mentioned in the previous article. As we mentioned earlier, the data generated by tools such as perf and Systemtap can be displayed more visually using the flame graph. The following diagram is an example of a flame graph.

flame graph

In the flame graph, the color and shade of the color blocks are meaningless, just to make a simple distinction between different color blocks. The flame graph is a superposition of the data sampled each time, so the user data are the width and length of the blocks.

For the flame graph on the CPU, the width of the color block is the percentage of CPU time occupied by the function: the wider the block, the greater the performance drain. If there is a flat-topped peak, that is where the performance bottleneck is. The length of the color block, on the other hand, represents the depth of the function call, with the top box showing the running function and all those below it being callers of that function. So, the function below is the supertype of the function above: the higher the peak, the deeper the function is called.

Summary

It is essential to know that even a non-intrusive technique like dynamic tracking is not perfect. It can only detect a particular individual process; in general, we only turn it on briefly to use the data sampled during that time. So if you need to detect across multiple services or for long periods, you still need a distributed tracing technique like opentracing.

What debugging tools and techniques do you use in your regular work? Welcome to leave a comment and discuss with me, also welcome you to share this article with your friends, so that we can learn and progress together.