January 29, 2015 - Tagged as: en, rpython, lua.
I found RPython very interesting for several reasons which I may be talking about later, and I need to use it for a project, so I started running some tutorials. However, I had some concerns about the idea(I still have, and I’ll defer the discussion to some other post for now), and I wanted to experiment with different implementations of same interpreter and compare results.
What I wanted to see is, given a very good and mature JIT compiler(LuaJIT in this case), how hard would it be to have similar optimizations without annotating code manually to give hints to the JIT compiler.
So I implemented a simple BF interpreter in Lua, and started experimenting with different optimizations. As for benchmarking, I used bench
program from this RPython tutorial repository.
Before diving into Lua implementation, here results of running them with RPython compiled interpreter, Python and PyPy:
./example5-rpython bench.b 0.94s user 0.00s system 99% cpu 0.947 total
pypy example5.py bench.b 15.57s user 0.01s system 99% cpu 15.597 total
python example5.py bench.b 597.34s user 0.04s system 99% cpu 9:57.87 total
The Lua implementation started with this:
function mainloop(program, bracket_map, dispatch_fn)
local pc = 1
local tape = {0}
local tape_pos = 1
local code
while pc <= #program do
= program[pc]
code if code == ">" then
= tape_pos + 1
tape_pos if #tape < tape_pos then
table.insert(tape, 0)
end
elseif code == "<" then
= tape_pos - 1
tape_pos elseif code == "+" then
[tape_pos] = tape[tape_pos] + 1
tapeelseif code == "-" then
[tape_pos] = tape[tape_pos] - 1
tapeelseif code == "." then
io.write(string.char(tape[tape_pos]))
elseif code == "[" and tape[tape_pos] == 0 then
= bracket_map[pc]
pc elseif code == "]" and tape[tape_pos] ~= 0 then
= bracket_map[pc]
pc end
= pc + 1
pc end
end
I’m not sure how many reasonable different implementations one can come up with, given that the language is this small. Still, there are some optimizations that we can do and I’ve tried some of them. Here are some things I tried:
I tried replacing one character strings with ASCII code equivalents. Since Lua doesn’t have character constants, I thought this may give us a few instructions per branch. But results were just the same.
I tried replacing table getters and setters with rawget
and rawset
s. Nothing changed. Apparently it’s not worth the effort unless you have a metatable for your table.
I tried generating a huge “if-then-else” statement for bracket_map
, and used it as a jump table kind of thing. Here’s the code:
function gen_dispatch_fn(bracket_map, fun_name)
local first = true
local acc = {}
table.insert(acc, "function " .. fun_name .. "(arg)\n")
for k,v in pairs(bracket_map) do
if first then
table.insert(acc, " if arg == " .. k .. " then\n")
= false
first else
table.insert(acc, " elseif arg == " .. k .. " then\n")
end
table.insert(acc, " return " .. v .. "\n")
end
table.insert(acc, " else\n")
table.insert(acc, " error(\"invalid arg: \" .. arg)\n")
table.insert(acc, " end\n")
table.insert(acc, "end\n")
return table.concat(acc)
end
I loaded this code using standard load()
function. This also didn’t work. The reason is that, even if this is faster(which is probably not always the case), profiling showed that interpreter spents only 4% of the time for bracket_map
lookups. So if this implementation only slightly faster, it just can’t make a big difference.
Profiling output revealed that, 85% of the time spent on fetching the next instruction:
@@ 69 @@
|
| local code
| while pc <= #program do
85% | code = program[pc]
| if code == 62 then
| tape_pos = tape_pos + 1
| if #tape < tape_pos then
@@ 89 @@
| pc = bracket_map[pc]
| elseif code == 93 and tape[tape_pos] ~= 0 then
4% | pc = bracket_map[pc]
| end
| pc = pc + 1
Which really means that you can’t optimize anything, because there’s nothing optimizable in code = program[pc]
, since this is one of the most primitive operations that you can do in this language. (note that we don’t have metamethod assigned to this table, so rawget
is not an optimization)
At this point the Lua results were like this:
luajit example_lua.lua bench.b 34.41s user 0.00s system 99% cpu 34.442 total
The fact that PyPy did better job than LuaJIT here is surprising and impressive. It seems like RPython and PyPy is doing a very good job here.
Since I already started gradually compiling things, I thought why not go further and compile everything. Here’s a simple BF to Lua compiler:
function compile(str)
local pgm = {}
table.insert(pgm, [[
function pgm()
local tape = {0}
local tape_pos = 1
]])
local adv = [[
tape_pos = tape_pos + 1
if #tape < tape_pos then
table.insert(tape, 0)
end
]]
local function dev(i) return " tape_pos = tape_pos - " .. i .. "\n" end
local function inc(i) return " tape[tape_pos] = tape[tape_pos] + " .. i .. "\n" end
local function dec(i) return " tape[tape_pos] = tape[tape_pos] - " .. i .. "\n" end
local out = " io.write(string.char(tape[tape_pos]))\n"
local inp = "" -- no need for this
local jmpF = " while tape[tape_pos] ~= 0 do\n"
local jmpB = " if tape[tape_pos] == 0 then break end end\n"
-- these are used to combine consecutive same instructions
local devs = 0
local incs = 0
local decs = 0
local indent = 0;
for i=1, #str do
local char = string.char(string.byte(str, i))
if devs ~= 0 and char ~= "<" then
table.insert(pgm, indent_lines(indent, dev(devs)))
= 0
devs elseif incs ~= 0 and char ~= "+" then
table.insert(pgm, indent_lines(indent, inc(incs)))
= 0
incs elseif decs ~= 0 and char ~= "-" then
table.insert(pgm, indent_lines(indent, dec(decs)))
= 0
decs end
if char == ">" then -- 62
table.insert(pgm, indent_lines(indent, adv))
elseif char == "<" then -- 60
= devs + 1
devs elseif char == "+" then -- 43
= incs + 1
incs elseif char == "-" then -- 45
= decs + 1
decs elseif char == "." then -- 46
table.insert(pgm, indent_lines(indent, out))
elseif char == "," then -- 44
table.insert(pgm, indent_lines(indent, inp))
elseif char == "[" then -- 91
= indent + 4
indent table.insert(pgm, indent_lines(indent, jmpF))
elseif char == "]" then -- 93
= indent - 4
indent table.insert(pgm, indent_lines(indent, jmpB))
end
end
table.insert(pgm, "end")
return table.concat(pgm)
end
One thing to note here is that loops in BF programs correspond to loops in generated Lua. There’s another way to implement this compiler and it might turn out to be more efficient, but I didn’t try it. (see BF-to-C compiler below) Also, I’m merging some instructions together. This has significant performance impact, but it’s also necessary because if the generated code is too big, both PUC-Lua and LuaJIT is rejecting to load it. (this is documented, but the limit is not specified)
Results:
luajit example_lua.lua bench.b 0.53s user 0.00s system 99% cpu 0.532 total
Note that runtime code generation and loading is NOT included in this number, but code generation takes less than 0.01s, so I might just include that.
Just for completeness, I also tried a C interpreter, and BF-to-C compiler:
./c-int bench.b 2.44s user 0.00s system 99% cpu 2.443 total
./c-compiled 0.00s user 0.00s system 82% cpu 0.004 total
A fun thing about C compiler is that compiling generated C programs takes long time:
gcc -O3 awk_output.c 14.07s user 0.14s system 99% cpu 14.219 total
RPython once again does an impressive job here, because it’s even faster than C interpreter. I didn’t bother profiling C code and optimizing it, because it looks like a reasonable implementation: A simple “fetch instruction and run it in a case statement” loop.
So I think the conclusion is that RPython and PyPy are doing really good job.