A Coredump Story About NGINX ctx and error_page

Coredumps in the production environment are terrifying for engineers, especially when the crashes occur in the middle of the night. I experienced it myself more than a decade ago. Trying to locate this “ghost” bug that only appeared in the nighttime, the whole team went back home to sleep in the afternoon and came back to the company at 12 am to wait for the crash and check each service one by one. Imagine a dark night and an office building with only a few workstations lit, watching the constantly beating monitor for data and logs – an intense and exciting experience. We stayed for three nights before locating the bug and solving it, which was an unforgettable experience.

Little did we know that we would experience this once again. It's just that this time it was not our own service, but the client’s online service had a coredump. This was more challenging than our own controllable environment:

The error only occurred in the user’s production environment, and the user could not reproduce the error even in its own pre-production environment and testing environment.
The user had custom code on top of the open source Apache APISIX, which was not visible until the NDA was signed separately.
Could not add debug logs.
The error only occurred at 3 am, thus requiring additional resources from the client to be handled immediately.

In short, this problem was unable to reproduce, with no complete source code, and no environment to test. However, we had to locate the error, find a way to reproduce and give a fix eventually. During the process, we had a lot of challenges and learned a lot from them. Here we are posting some interesting points encountered during the debugging process. I hope they will provide some guidance when people are debugging for NGINX and APISIX.

Problem Description

After the user upgraded APISIX from the 2.4.1 version to the 2.13.1 version, coredumps occurred cyclically, with the following error message:

Error Info

From the coredump message, we could see that the segmentation fault occurs in the lua-var-nginx-module. And here is the corresponding data (partial) in memory:

#0  0x00000000005714d4 in ngx_http_lua_var_ffi_scheme (r=0x570d700, scheme=0x7fd2c241a760)
    at /tmp/vua-openresty/openresty-1.19.3.1-build/openresty-1.19.3.1/../lua-var-nginx-module/src/ngx_http_lua_var_module.c:152
152        /tmp/vua-openresty/openresty-1.19.3.1-build/openresty-1.19.3.1/../lua-var-nginx-module/src/ngx_http_lua_var_module.c: No such file or directory.
(gdb) print *r
$1 = {signature = 0, connection = 0x0, ctx = 0x15, main_conf = 0x5adf690, srv_conf = 0x0, loc_conf = 0x0, read_event_handler = 0xe, write_event_handler = 0x5adf697,
  cache = 0x2, upstream = 0x589c15, upstream_states = 0x0, pool = 0x0, header_in = 0xffffffffffffffff}

We can see that the data in memory has a problem.

Some Thoughts Before Analysis

Segmentation fault usually occurs in two scenarios:

Read/write an invalid address in memory, such as accessing an array index out of bounds, segmentation fault occurs immediately.
Do an invalid write but modify a valid address, so it doesn’t generate a segmentation fault immediately. During the running of the program, it tries to access data in the faulty address and generates a segmentation fault. For example, when a pointer’s value is modified mistakenly, later when we access this pointer, it might trigger a segmentation fault.

For the first scenario, the problem can be easily located if we check the call stack.

For the second scenario, since it is not clear where the error happens in the first place, the code that generated the error and the code that triggered the segmentation fault may not be in the same location, even completely unrelated, causing it difficult to locate. We could only collect more context information involving the crash, such as:

Current APISIX’s configuration details.
The processing stage of the current request.
The details of the current request.
The number of concurrent connections.
The error log.

Through this information, we would try to find the bug by reviewing the code and finding a way to reproduce the problem.

Analysis

1. Context Check of the Errors

After careful analysis, we found that the error usually occurs between 3 am and 4 am, with the following coredump error message:

Error Log

Eventually, we realized that the error log is associated with user operations. Due to certain reasons, all the upstream messages would be cleared at this time. So, we suspected that the problem associated with upstream clear/reset operations, and the coredump might be caused by entering an exception branch after the error returns.

2. Get Lua’s Call Stack

Since GDB cannot trace Lua’s call stack, we were unable to locate which call produced the error. So, the first thing we would need to do is to get Lua’s call stack; we had two methods to get the call stacks:

In lua-var-nginx-module library add in code to print out the call stack (print(debug.traceback(...)). The downside is that it will generate a lot of error logs, negatively impacting the production environment.
Use openresty-gdb-utils maintained by API7.ai, this toolkit uses GDB’s DSL and python extension to enhance gdb’s ability to analyze Lua’s call stack. Note, you will need to turn on the debug symbol in compile time. It is defaulted to be included in APISIX.

Then, we got the following call stack.

C's Call Stack

Lua's Call Stack

Combining both Lua and c’s call stack, we could find that the coredump was because the user called ngx.ctx.api_ctx.var.scheme in the Prometheus plugin. However, the reason for the crash was still unclear, we would need to analyze it further.

3. Cache Caused the Problem

The error occurred when accessing a variable from ngx.ctx.api_ctx.var executed a call in the lua-var-nginx-module described above. To increase efficiency, it cached the current request. We remembered the error occurred when the value of the request body had an exception, so the cached request’s validity was also questionable. To verify this thought, we stopped the program from taking the cached memory, but re-fetched the request every time.

else
    --val = get_var(key, t._request)
    val = get_var(key, nil)              <============ t._request change to nil
end

The user tested with this modified version for an entire night, and the error did not occur again; we verified that the request body had a problem.

4. The Error in ngx.ctx

The cached request body was saved in ngx.ctx, and the only place that modified ngx.ctx was in apisix/init.lua.

function _M.http_header_filter_phase()
    if ngx_var.ctx_ref ~= '' then
        -- prevent for the table leak
        local stash_ctx = fetch_ctx()

        -- internal redirect, so we should apply the ctx
        if ngx_var.from_error_page == "true" then
            ngx.ctx = stash_ctx          <================= HERE
        end
    end

    core.response.set_header("Server", ver_header)

    local up_status = get_var("upstream_status")
    if up_status then
        set_resp_upstream_status(up_status)
    end

Why did we need to recover ngx.ctx here? Because ngx.ctx is stored in the Lua registry, the corresponding ngx.ctx can be found through the registry index. And the index was stored in the ctx member of the NGINX request structure. Every time an internal redirect was made, NGINX would clear ctx, and thus the index could not be found.

To solve this problem, APISIX created a table, the table saved the current ngx.ctx before redirecting, and then used ngx.var to record the index of ngx.ctx in this table, whenever it needed to be restored, we could get ngx.ctx directly from the table.

The code in the above figure only got triggered when the user configured error_page instructions, and after an error occurred after internal redirections.

And we knew exactly that the user turned on the error_page instruction after the version upgrade, combined with the error that occurred from the upstream change we described above. All things seemed to be connected. Was it actually the ngx.ctx that created the bug during the error handling process?

5. Why Does ngx.ctx Has a Bug

With the question, we checked ngx.ctx code related to backup, and we found weirder problems. After failing at set_upstream, it would never take the ngx.ctx recovering step, because it exited early, it would not back up ngx.ctx, and it would not go to the recovery process either.

Related Code

This was obviously a bug! Well, this is because ngx.ctx needed to be recovered after redirecting. And how did this process recover ngx.ctx? Where did ngx.ctx go wrong in the first place? Now there were too many uncertainties, we needed to collect more data.

After discussing, we added a second log on production, and we discovered that, when the error occurred after an internal redirection, ngx.ctx did not go through the recovery process, rather it caused coredump directly!

This was a perplexing phenomenon because if the ngx.ctx did not recover after redirection, ngx.ctx would be empty/null. And there was code that checks for empty values; it should not trigger the coredump code in the plugin.

local function common_phase(phase_name)
    local api_ctx = ngx.ctx.api_ctx
    if not api_ctx then              <============ HERE
        return
    end

    plugin.run_global_rules(api_ctx, api_ctx.global_rules, phase_name)

So, was this error_page internal redirection that caused triggered ngx.ctx to have a problem?

6.Initial Conclusion

After analyzing and checking, we got a basic understanding of the erroneous process.

set_upstream failed and redirected to error_page’s error processing stage, and the expected empty ngx.ctx had an unexpected value. And due to APISIX’s bug, it didn’t recover ngx.ctx before redirecting, causing ngx.ctx to have dirty data when accessing, thus causing the coredump. To solve this problem, we would only need to recover ngx.ctx after redirecting; the detailed implementation of this commit is provided at the end of the article for reference.

Although we provided a fix for the bug, we were still unable to understand the complete logic of the bug. Because the user didn’t modify the logic of ngx.ctx, the open-source version should be able to reproduce. In order not to continue to affect the user’s production environment and deploying process, we decided to continue to investigate the root cause.

7.Successful Reproduction

From what we had known, we were unable to reproduce the problem. After analysis, we decided to check two of the following places:

Did internal redirection have special branches for processing?
Did ngx.ctx have special branches for processing?

For the first point, during the processing of each filter module of NGINX, it may have some exception branches that affected the ctx members of the request body, thereby affecting ngx.ctx. We checked the user’s modules compiled into NGINX for potential branch exceptions but found no similar problems.

For the second point, after investigation, we have found that during the SSL handshake, ngx.ctx has a problem from reuse. If the HTTP stage could not obtain ngx.ctx, it would try to obtain it from the SSL stage. The SSL stage has its own ngx.ctx, and that might cause problems for ngx.ctx.

Based on the above discoveries, we designed the following conditions, and were able to reproduce the problem successfully:

Same APISIX version as the user’s
Turn on the SSL certificate
Create an internal error that triggers error_page (such as requesting upstream when there is none)
Long connection
Prometheus plugin (the coredump occurred in the plugin for the user)

We finally made bricks without straw. Through conjecture and verification, we finally figured out the complete process of this bug. Once there is a complete reproduction, it is no longer difficult to solve the problem, and the final process will not be described in detail here. I believe that everyone has plenty of experiences in problems that can be accurately reproduced.

Summary

The root cause of this problem: because the ngx.ctx is not restored after entering the error_page, all requests on the long connection reuse the same ngx.ctx in the SSL phase. Eventually, causing ngx.ctx to have dirty data and a coredump was generated.

In the follow-up, we also proposed a corresponding solution to this problem: to ensure that ngx.ctx can be restored normally after the internal redirection. Please refer to the specific PR.

If your APISIX uses functions involving internal redirection such as error_page, please upgrade to the latest version as soon as possible.

Writing programs and debugging are scientifically rigorous tasks and there is no room for ambiguity. To solve a problem, we will need to collect information, keep trap of all potential clues, analyze the call stack, and then figure out the context to reproduce the error. Finally, problem solved.