Top Tips: Identifying Unique Concepts and Pitfalls in Lua

API7.ai

October 12, 2022

OpenResty (NGINX + Lua)

In the previous article, we learned about the table-related library functions in LuaJIT. In addition to these common functions, today I'll introduce you to some unique or less common Lua concepts and common Lua pitfalls in OpenResty.

Weak Table

First, there is the weak table, a unique concept in Lua, which is related to garbage collection. Like other high-level languages, Lua is automatically garbage collected, you don't have to care about the implementation, and you don't have to explicitly GC. The garbage collector will automatically collect the space that is not referenced.

But simple reference counting isn't quite enough, and sometimes we need a more flexible mechanism. For example, if we insert a Lua object Foo (table or function) into table tb, this creates a reference to that object Foo. Even if there is no other reference to Foo, the reference to it in tb will always exist, so there is no way for the GC to reclaim the memory occupied by Foo. At this point, we have only two options.

  • One is to release Foo manually.
  • The second is to make it resident in memory.

For example, the following code.

$ resty -e 'local tb = {}
tb[1] = {red}
tb[2] = function() print("func") end
print(#tb) -- 2

collectgarbage()
print(#tb) -- 2

table.remove(tb, 1)
print(#tb) -- 1

However, I think you don't want to keep memory occupied by objects you don't use, especially since LuaJIT has a 2G memory limit. The timing of manual freeing is not easy and adds complexity to your code.

Then it's time for the weak table to come into play. Look at its name, weak table. First, it is a table, and then all the elements in this table are weak references. The concept is always abstract, so let's start by looking at a slightly modified piece of code.

$ resty -e 'local tb = {}
tb[1] = {red}
tb[2] = function() print("func") end
setmetatable(tb, {__mode = "v"})
print(#tb)  -- 2

collectgarbage()
print(#tb) -- 0
'

As you can see, objects that are not being used are freed. The most important of these is the following line of code.

setmetatable(tb, {__mode = "v"})

Is it déjà vu? Isn't that the operation of a meta table? Yes, a table is weak table when it has a __mode field in its meta table.

  • If the value of __mode is k, the table's key is a weak reference.
  • If the value of __mode is v, then the table's value is a weak reference.
  • Of course, you can also set it to kv, indicating that both the keys and values of this table are weak references.

Any of these three weak tables will have their entire key-value object reclaimed once its key or value is reclaimed.

In the code example above, the value of __mode is v, tb is an array, and the value of the array is the table and function object so that it can be recycled automatically. However, if you change the value of __mode to k, it won't be freed, for example, if you look at the following code.

$ resty -e 'local tb = {}
tb[1] = {red}
tb[2] = function() print("func") end
setmetatable(tb, {__mode = "k"})
print(#tb)  -- 2

collectgarbage()
print(#tb) -- 2
'

We only demonstrate weak tables where the value is a weak reference, i.e., weak tables of the array type. Naturally, you can also build a weak table of the hash table type by using an object as the key, for example, as follows.

$ resty -e 'local tb = {}
tb[{color = red}] = "red"
local fc = function() print("func") end
tb[fc] = "func"
fc = nil

setmetatable(tb, {__mode = "k"})
for k,v in pairs(tb) do
     print(v)
end

collectgarbage()
print("----------")
for k,v in pairs(tb) do
     print(v)
end
'

After manually calling collectgarbage() to force GC, all the elements in the entire table of tb will have been freed. Of course, in the actual code, we don't need to call collectgarbage() manually, it will run automatically in the background, and we don't need to worry about it.

However, since we mentioned the collectgarbage() function, I'll say a few more words about it. This function can be passed several different options and defaults to collect, which is a full GC. Another useful one is count, which returns the amount of memory space occupied by Lua. This statistic is helpful to let you see if there is a memory leak and reminds us not to approach the 2G upper limit.

The code related to weak tables is more complicated to write in practice, less easy to understand, and correspondingly, more hidden bugs. No need to rush. Later, I will introduce an open source project, using weak tables brought about by the memory leakage problem.

Closure and upvalue

Turning to closures and upvalue, as I emphasized earlier, all values are first-class citizens in Lua, as are included functions. This means that functions can be stored in variables, passed as arguments, and returned as values of another function. For example, this sample code appears in the weak table above.

tb[2] = function() print("func") end

It is an anonymous function that is stored as the value of a table.

In Lua, the definition of the two functions in the following code is equivalent. However, note that the latter assigns a function to a variable, a method we often use.

local function foo() print("foo") end
local foo = fuction() print("foo") end

In addition, Lua supports writing a function inside another function, i.e., nested functions, such as the following example code.

$ resty -e '
local function foo()
     local i = 1
     local function bar()
         i = i + 1
         print(i)
     end
     return bar
end

local fn = foo()
print(fn()) -- 2
'

You can see that the bar function can read the local variable i inside the foo function and modify its value, even if the variable is not defined inside bar. This feature is called lexical scoping.

These features of Lua are the basis for closures. A closure is simply a function that accesses a variable in the lexical scope of another function.

By definition, all functions in Lua are actually closures, even if you don't nest them. This is because the Lua compiler takes outside the Lua script and wraps it with another layer of the main function. For example, the following simple lines of code.

local foo, bar
local function fn()
     foo = 1
     bar = 2
end

After compilation, it will look like this.

function main(...)
     local foo, bar
     local function fn()
         foo = 1
         bar = 2
     end
end

And the function fn captures two local variables of the main function, so it is also a closure.

Of course, we know that the concept of closures exists in many languages, and it is not unique to Lua, so you can compare and contrast to get a better understanding. Only when you understand closures can you understand what we're going to say about upvalue.

upvalue is a concept that is unique to Lua, which is the variable outside the lexical scope captured in the closure. Let's continue with the code above.

local foo, bar
local function fn()
     foo = 1
     bar = 2
end

You can see that the function fn captures two local variables, foo and bar, that are not in their own lexical scope and that these two variables are, in fact, the upvalue of the function fn.

Common Pitfalls

After introducing a few concepts in Lua, I'll talk about the Lua-related pitfalls that I encountered in OpenResty development.

In the previous section, we mentioned some of the differences between Lua and other development languages, such as index starting at 1, default global variables, etc. In OpenResty's actual code development, we will encounter more Lua and LuaJIT-related problems, and I will talk about some of the more common ones below.

Here's a reminder that even if you know all the pitfalls, you'll inevitably have to step through them yourself to be impressed. The difference, of course, is that you'll be able to climb out of the hole and find the crux of the problem in a much better way.

Does the index start at 0 or 1?

The first pitfall is that Lua's indexing starts at 1, as we've mentioned repeatedly before.

But I have to say that this is not the whole truth. Because in LuaJIT, arrays created with ffi.new are index from 0 again:

local buf = ffi_new("char[?]", 128)

So, if you want to access the buf cdata in the above code, please remember that the index starts from 0, not 1. Be sure to pay special attention to this place when you use FFI to interact with C.

Regular Pattern Match

The second pitfall is the regular pattern matching problem, and there are two sets of string matching methods in parallel in OpenResty: Lua's sting library and OpenResty's ngx.re.* API.

Lua's regular pattern matching is its unique format and is written differently than PCRE. Here is a simple example.

resty -e 'print(string.match("foo 123 bar", "%d%d%d"))'123

This code extracts the numeric part from the string, and you'll notice it's completely different from our familiar regular expressions. Lua's regular matching library is expensive to maintain and low performing - JIT can't optimize it, and patterns that have been compiled once aren't cached.

So, when you use Lua's built-in string library to find, match,etc., don't hesitate to use OpenResty's ngx.re instead if you need something like a regular. When looking for a fixed string, we consider using the plain mode to call the string library.

Here's a suggestion: In OpenResty, we always prioritize OpenResty's API, then LuaJIT's API, and use Lua libraries with caution.

The JSON encoding does not distinguish between array and dict

The third pitfall is that the JSON encoding does not distinguish between array and dict; since Lua has only one data structure, table, when JSON encodes an empty table, there is no way to determine whether it is an array or a dictionary.

resty -e 'local cjson = require "cjson"
local t = {}
print(cjson.encode(t))
'

For example, the above code outputs {}, which shows that OpenResty's cjson library encodes an empty table as a dictionary by default. Of course, we can change this global default by using the encode_empty_table_as_object function.

resty -e 'local cjson = require "cjson"
cjson.encode_empty_table_as_object(false)
local t = {}
print(cjson.encode(t))
'

This time, the empty table is encoded as an array [].

However, this global setting has a significant impact, so can we specify the encoding rules for a particular table? The answer is naturally yes, and there are two ways to do it.

The first way is to assign the userdata cjson.empty_array to the specified table so that it will be treated as an empty array when encoded in JSON.

$ resty -e 'local cjson = require "cjson"
local t = cjson.empty_array
print(cjson.encode(t))
'

However, sometimes we are unsure if the specified table is always empty. We want to encode it as an array when it is empty, so we use the cjson.empty_array_mt function, which is our second method.

It will mark the specified table and encode it as an array when the table is empty. As you can see from the name cjson.empty_array_mt, it is set using a metatable, as in the following code operation.

$ resty -e 'local cjson = require "cjson"
local t = {}
setmetatable(t, cjson.empty_array_mt)
print(cjson.encode(t))
t = {123}
print(cjson.encode(t))
'

Limitation on the number of variables

Let's look at the fourth pitfall, the limit on the number of variables. Lua has an upper limit on the number of local variables and the number of upvalue s in a function, as you can see from the Lua source code.

/*
@@ LUAI_MAXVARS is the maximum number of local variables per function
@* (must be smaller than 250).
*/
#define LUAI_MAXVARS            200


/*
@@ LUAI_MAXUPVALUES is the maximum number of upvalues per function
@* (must be smaller than 250).
*/
#define LUAI_MAXUPVALUES        60

These two thresholds are hardcoded to 200 and 60, respectively, and although you can manually modify the source code to adjust these two values, they can only be set to a maximum of 250.

Generally, we don't exceed this threshold. Still, when writing OpenResty code, you should be careful not to overuse local variables and upvalue s, but to use do ... end as much as possible to reduce the number of local variables and upvalue s.

For example, let's look at the following pseudo-code.

local re_find = ngx.re.find
function foo() ... end
function bar() ... end
function fn() ... end

If only the function foo uses re_find, then we can modify it as follows:

do
    local re_find = ngx.re.find
    function foo() ... end
end
function bar() ... end
function fn() ... end

Summary

From the point of view of "asking more questions", where does the threshold of 250 in Lua come from? This is our thinking question for today. You are welcome to leave your comments and share this article with your colleagues and friends. We will communicate and improve together.