osa1

Some benchmarks for meta-tracing BF JIT and traditional BF implementations

January 29, 2015 - Tagged as: en, rpython, lua.

I found RPython very interesting for several reasons which I may be talking about later, and I need to use it for a project, so I started running some tutorials. However, I had some concerns about the idea(I still have, and I’ll defer the discussion to some other post for now), and I wanted to experiment with different implementations of same interpreter and compare results.

What I wanted to see is, given a very good and mature JIT compiler(LuaJIT in this case), how hard would it be to have similar optimizations without annotating code manually to give hints to the JIT compiler.

So I implemented a simple BF interpreter in Lua, and started experimenting with different optimizations. As for benchmarking, I used bench program from this RPython tutorial repository.

Before diving into Lua implementation, here results of running them with RPython compiled interpreter, Python and PyPy:

./example5-rpython bench.b    0.94s user 0.00s system 99% cpu 0.947 total
pypy example5.py   bench.b   15.57s user 0.01s system 99% cpu 15.597 total
python example5.py bench.b  597.34s user 0.04s system 99% cpu 9:57.87 total

The Lua implementation started with this:

function mainloop(program, bracket_map, dispatch_fn)
    local pc = 1
    local tape = {0}
    local tape_pos = 1

    local code
    while pc <= #program do
        code = program[pc]
        if code == ">" then
            tape_pos = tape_pos + 1
            if #tape < tape_pos then
                table.insert(tape, 0)
            end
        elseif code == "<" then
            tape_pos = tape_pos - 1
        elseif code == "+" then
            tape[tape_pos] = tape[tape_pos] + 1
        elseif code == "-" then
            tape[tape_pos] = tape[tape_pos] - 1
        elseif code == "." then
            io.write(string.char(tape[tape_pos]))
        elseif code == "[" and tape[tape_pos] == 0 then
            pc = bracket_map[pc]
        elseif code == "]" and tape[tape_pos] ~= 0 then
            pc = bracket_map[pc]
        end
        pc = pc + 1
    end
end

I’m not sure how many reasonable different implementations one can come up with, given that the language is this small. Still, there are some optimizations that we can do and I’ve tried some of them. Here are some things I tried:

I tried replacing one character strings with ASCII code equivalents. Since Lua doesn’t have character constants, I thought this may give us a few instructions per branch. But results were just the same.
I tried replacing table getters and setters with rawget and rawsets. Nothing changed. Apparently it’s not worth the effort unless you have a metatable for your table.

I tried generating a huge “if-then-else” statement for bracket_map, and used it as a jump table kind of thing. Here’s the code:

function gen_dispatch_fn(bracket_map, fun_name)
    local first = true
    local acc = {}
    table.insert(acc, "function " .. fun_name .. "(arg)\n")
    for k,v in pairs(bracket_map) do
        if first then
            table.insert(acc, "    if arg == " .. k .. " then\n")
            first = false
        else
            table.insert(acc, "    elseif arg == " .. k .. " then\n")
        end
        table.insert(acc, "        return " .. v .. "\n")
    end
    table.insert(acc, "    else\n")
    table.insert(acc, "        error(\"invalid arg: \" .. arg)\n")
    table.insert(acc, "    end\n")
    table.insert(acc, "end\n")
    return table.concat(acc)
end

I loaded this code using standard load() function. This also didn’t work. The reason is that, even if this is faster(which is probably not always the case), profiling showed that interpreter spents only 4% of the time for bracket_map lookups. So if this implementation only slightly faster, it just can’t make a big difference.

Profiling output revealed that, 85% of the time spent on fetching the next instruction:

@@ 69 @@
      |
      |     local code
      |     while pc <= #program do
  85% |         code = program[pc]
      |         if code == 62 then
      |             tape_pos = tape_pos + 1
      |             if #tape < tape_pos then
@@ 89 @@
      |             pc = bracket_map[pc]
      |         elseif code == 93 and tape[tape_pos] ~= 0 then
   4% |             pc = bracket_map[pc]
      |         end
      |         pc = pc + 1

Which really means that you can’t optimize anything, because there’s nothing optimizable in code = program[pc], since this is one of the most primitive operations that you can do in this language. (note that we don’t have metamethod assigned to this table, so rawget is not an optimization)

At this point the Lua results were like this:

luajit example_lua.lua bench.b  34.41s user 0.00s system 99% cpu 34.442 total

The fact that PyPy did better job than LuaJIT here is surprising and impressive. It seems like RPython and PyPy is doing a very good job here.

Since I already started gradually compiling things, I thought why not go further and compile everything. Here’s a simple BF to Lua compiler:

function compile(str)
    local pgm = {}
    table.insert(pgm, [[
function pgm()
    local tape = {0}
    local tape_pos = 1
]])

    local adv = [[
    tape_pos = tape_pos + 1
    if #tape < tape_pos then
        table.insert(tape, 0)
    end
]]
    local function dev(i) return "    tape_pos = tape_pos - " .. i .. "\n" end
    local function inc(i) return "    tape[tape_pos] = tape[tape_pos] + " .. i .. "\n" end
    local function dec(i) return "    tape[tape_pos] = tape[tape_pos] - " .. i .. "\n" end
    local out  = "    io.write(string.char(tape[tape_pos]))\n"
    local inp  = "" -- no need for this
    local jmpF = "    while tape[tape_pos] ~= 0 do\n"
    local jmpB = "    if tape[tape_pos] == 0 then break end end\n"

    -- these are used to combine consecutive same instructions
    local devs = 0
    local incs = 0
    local decs = 0

    local indent = 0;

    for i=1, #str do
        local char = string.char(string.byte(str, i))

        if devs ~= 0 and char ~= "<" then
            table.insert(pgm, indent_lines(indent, dev(devs)))
            devs = 0
        elseif incs ~= 0 and char ~= "+" then
            table.insert(pgm, indent_lines(indent, inc(incs)))
            incs = 0
        elseif decs ~= 0 and char ~= "-" then
            table.insert(pgm, indent_lines(indent, dec(decs)))
            decs = 0
        end

        if char == ">" then -- 62
            table.insert(pgm, indent_lines(indent, adv))
        elseif char == "<" then -- 60
            devs = devs + 1
        elseif char == "+" then -- 43
            incs = incs + 1
        elseif char == "-" then -- 45
            decs = decs + 1
        elseif char == "." then -- 46
            table.insert(pgm, indent_lines(indent, out))
        elseif char == "," then -- 44
            table.insert(pgm, indent_lines(indent, inp))
        elseif char == "[" then -- 91
            indent = indent + 4
            table.insert(pgm, indent_lines(indent, jmpF))
        elseif char == "]" then -- 93
            indent = indent - 4
            table.insert(pgm, indent_lines(indent, jmpB))
        end
    end

    table.insert(pgm, "end")
    return table.concat(pgm)
end

One thing to note here is that loops in BF programs correspond to loops in generated Lua. There’s another way to implement this compiler and it might turn out to be more efficient, but I didn’t try it. (see BF-to-C compiler below) Also, I’m merging some instructions together. This has significant performance impact, but it’s also necessary because if the generated code is too big, both PUC-Lua and LuaJIT is rejecting to load it. (this is documented, but the limit is not specified)

Results:

luajit example_lua.lua bench.b  0.53s user 0.00s system 99% cpu 0.532 total

Note that runtime code generation and loading is NOT included in this number, but code generation takes less than 0.01s, so I might just include that.

Just for completeness, I also tried a C interpreter, and BF-to-C compiler:

./c-int bench.b  2.44s user 0.00s system 99% cpu 2.443 total
./c-compiled     0.00s user 0.00s system 82% cpu 0.004 total

A fun thing about C compiler is that compiling generated C programs takes long time:

gcc -O3 awk_output.c  14.07s user 0.14s system 99% cpu 14.219 total

RPython once again does an impressive job here, because it’s even faster than C interpreter. I didn’t bother profiling C code and optimizing it, because it looks like a reasonable implementation: A simple “fetch instruction and run it in a case statement” loop.

So I think the conclusion is that RPython and PyPy are doing really good job.

(Show comments)