How to Avoid Cache Stampede?

In the previous article, we learned some high-performance optimization techniques with shared dict and lru cache. However, we left behind a significant issue that deserves its article today, "Cache Stampede".

What is a Cache Stampede?

Let us imagine a scenario.

The data source is in a MySQL database, the cached data is in a shared dict, and the timeout is 60 seconds. During the 60 seconds of data in the cache, all requests are fetching data from the cache rather than from MySQL. But once over 60 seconds, the cached data expires. If there is a large number of concurrent requests, no data in the cache can be queried. Then the query function of the data source will be triggered, and all these requests will go to the MySQL database, which will directly cause the database server to block or even go down.

This phenomenon can be called "Cache Stampede", and it is sometimes referred to as a Dog-Piling. None of the cache-related code that appeared in the previous sections has a corresponding treatment. The following is an example of pseudo-code that has the potential for cache stampede.

local value = get_from_cache(key)
if not value then
    value = query_db(sql)
    set_to_cache(value, timeout ＝ 60)
end
return value

The pseudo-code looks like the logic is all right, and you won't trigger a cache stampede using unit tests, or end-to-end tests. Only a long stress test will find the problem. Every 60 seconds, the database will have a regular spike in queries. But if you set a longer cache expiration time here, the chances of the cache storm problem being detected are reduced.

How to avoid it?

Let's divide the discussion into several different cases.

1. Proactively update the cache

In the above pseudo-code, the cache is updated passively and only goes to the database to query for new data when it is requested but a cache failure is found. Therefore, changing how the cache is updated from passive to active can bypass the cache stampede problem.

In OpenResty, we can implement it like this.

First, we use ngx.timer.every to create a timer task that runs every minute to fetch the latest data from the MySQL database and put it in the shared dict:

local function query_db(premature, sql)
    local value = query_db(sql)
    set_to_cache(value, timeout ＝ 60)
end

local ok, err = ngx.timer.every(60, query_db, sql)

Then, in the logic of the code that handles the request, we need to remove the part that queries MySQL and keep only the part of the code that gets the shared dict cache.

local value = get_from_cache(key)
return value

The above two pseudo-code snippets can help us to bypass the cache stampede problem. But this approach is not perfect, each cache has to correspond to a periodic task (there is an upper limit to the number of timers in OpenResty), and the cache expiration time and the cycle time of the scheduled task have to correspond well. If there is any mistake during this period, the request may keep getting empty data.

So, in real projects, we usually use locking to solve the cache stampede problem. Here are a few different locking methods, you can choose your own according to your needs.

2. lua-resty-lock

When it comes to adding locks, you may feel difficult, thinking that it is a heavy operation, and what if a deadlock occurs and you have to deal with quite a few exceptions.

We can alleviate this concern by using the lua-resty-lock library in OpenResty to add locks. lua-resty-lock is OpenResty's resty library, which is based on a shared dict and provides a non-blocking lock API. Let's look at a simple example.

resty --shdict='locks 1m' -e 'local resty_lock = require "resty.lock"
                            local lock, err = resty_lock:new("locks")
                            local elapsed, err = lock:lock("my_key")
                            -- query db and update cache
                            local ok, err = lock:unlock()
                            ngx.say("unlock: ", ok)'

Since lua-resty-lock is implemented using a shared dict, we need to declare the name and size of the shdict first and then use the new method to create a new lock object. In the above code snippet, we only pass the first parameter, the name of shdict. The new method has a second parameter, which can be used to specify the expiration time, the timeout time for the lock, and many other parameters. Here we keep the default values. These parameters are used to avoid deadlocks and other exceptions.

Then we can call the lock method to try to get a lock. If we succeed in acquiring the lock, we can ensure that only one request goes to the data source to update the data at the same moment. But if the locking fails due to preemption, timeout, etc., then the data is fetched from the stale cache and returned to the requestor. This brings us to the get_stale API introduced in the previous lesson.

local elapsed, err = lock:lock("my_key")
# elapsed to nil means that locking failed. The return value of err is one of timeout, locked
if not elapsed and err then
    dict:get_stale("my_key")
end

If lock succeeds, then it is safe to query the database and update the results to the cache, and finally, we call the unlock interface to release the lock.

Combining lua-resty-lock and get_stale, we have the perfect solution to the cache stampede problem. The documentation for lua-resty-lock gives a very complete code for handling it. If interested, you can check it out at here.

Let's go deeper and see how the lock interface implements locking. When we come across some interesting implementation, we always want to see how it is implemented in the source code, which is one of the benefits of open source.

local ok, err = dict:add(key, true, exptime)
if ok then
    cdata.key_id = ref_obj(key)
    self.key = key
    return 0
end

As mentioned in the article of shared dict, all APIs of shared dict are atomic operations and there is no need to worry about contention. So it is a good idea to use shared dict to mark the state of locks.

The implementation of lock above uses dict:add to try to set the key: if the key does not exist in shared memory, add will return success, indicating that the locking was successful; other concurrent requests will return failure when they reach the logic of the dict:add line of code, and then the code can choose whether to return directly, or then the code can choose whether to return directly or to retry multiple times based on the returned err information.

3. lua-resty-shcache

In the above implementation of lua-resty-lock, you need to handle locking, unlocking, fetching expired data, retrying, exception handling, and other issues, which is still quite tedious.

Here is a simple wrapper for you: lua-resty-shcache, which is a lua-resty library from Cloudflare, it does a layer of encapsulation on top of shared dictionaries and external storage and provides additional functions for serialization and deserialization, so you don't have to care about the above details:

local shcache = require("shcache")

local my_cache_table = shcache:new(
        ngx.shared.cache_dict
        { external_lookup = lookup,
          encode = cmsgpack.pack,
          decode = cmsgpack.decode,
        },
        { positive_ttl = 10,           -- cache good data for 10s
          negative_ttl = 3,            -- cache failed lookup for 3s
          name = 'my_cache',     -- "named" cache, useful for debug / report
        }
    )

local my_table, from_cache = my_cache_table:load(key)

This sample code is extracted from the official example and has hidden all the details. This cache encapsulation library is not the best choice, but it's good learning material for beginners. The following article will introduce a few other better and more commonly used encapsulations.

4. NGINX directives

If you are not using OpenResty's lua-resty library, you can also use the NGINX configuration directives for locking and fetching expired data: proxy_cache_lock and proxy_cache_use_stale. However, we do not recommend using the NGINX directive here, as it is not flexible enough, and its performance is not as good as Lua code.

Summary

Cache stampedes, like the race problem we've mentioned repeatedly before, are hard to detect through code reviews and tests. The best way to solve them is to improve your coding or use an encapsulation library.

One last question: How do you handle cache stampede and such in the languages and platforms you are familiar with? Is there a better way than OpenResty? Feel free to share with me in the comments.