osa1

Three runtime optimizations done by GHC's GC

March 16, 2018 - Tagged as: en, haskell, ghc.

While working on GHC’s GC code I realized that it does some runtime optimizations. One of those I already knew from another language, but the other two were quite interesting to me because they’re related with laziness. I wouldn’t think consequences of laziness reach this far into the runtime system. It turns out it does; disabling those optimizations make programs run significantly slower.

Because I almost read the whole code line by line, I believe this list is exhaustive. The code is taken from the source code but significantly simplified.

If you’re not familiar with GHC’s heap object layout and info tables etc., I suggest reading the wiki page before moving on the rest of the post.

1. Replacing small Int and Char closures with statically initialized, shared closures

Int and Char closures have one non-pointer field for the actual integer and character values, as can be seen in GHCi:

λ> :info Int
data Int = GHC.Types.I# GHC.Prim.Int#     -- Defined in ‘GHC.Types’
λ> :info Char
data Char = GHC.Types.C# GHC.Prim.Char#   -- Defined in ‘GHC.Types’

The corresponding closure type for closures with one non-pointer and no pointers is CONSTR_0_1. The garbage collector needs to check closure type before copying an object to decide how many bytes to copy (and also to decide what pointers to follow and copy the pointed object, but this happens in a later stage). When it finds a CONSTR_0_1 it checks if it’s actually an Int or Char closure, if it is, it checks if the payload (the actual Int and Char values) is within a range. If it is then we know that we have statically-allocated Int or Char closure what is identical to the one we’re copying, so we return address to the statically allocated one rather than copying the closure and returning the new address of the copied closure. This way we avoid having multiple closures for 1 :: Int, for example. The code (simplified, some comments by me):

case CONSTR_0_1:
{
    // Constructor with one non-pointer field. Read the field.
    StgWord w = (StgWord)q->payload[0];

    if (// is it a Char?
        info == Czh_con_info &&
        // is the value in range?
        (StgChar)w <= MAX_CHARLIKE)
    {
        // return address to statically allocated Char closure
        *p =  TAG_CLOSURE(tag, (StgClosure *)CHARLIKE_CLOSURE((StgChar)w));
    }
    else if (// is it an Int?
             info == Izh_con_info &&
             // is the value in range?
             (StgInt)w >= MIN_INTLIKE && (StgInt)w <= MAX_INTLIKE)
    {
        // return address to statically allocated Int closure
        *p = TAG_CLOSURE(tag, (StgClosure *)INTLIKE_CLOSURE((StgInt)w));
    }
    // otherwise copy the object
    else
    {
        copy_tag_nolock(p,info,q,sizeofW(StgHeader)+1,gen_no,tag);
    }
    return;
}

What are the ranges here? Looking at the definition, we see that integers in range [-16, 16] and the whole ASCII character set is covered.

Here’s a small program that shows the effect of this optimization:

module Main where

import GHC.Stats
import System.Mem (performMajorGC)

seqIntList :: [Int] -> a -> a
seqIntList []       a = a
seqIntList (i : is) a = i `seq` is `seqIntList` a

main :: IO ()
main = do
    let lst = [ 0 .. 15 ]
    -- let lst = [ 17 .. 32 ] -- enable this on the second run

    -- evaluate the list
    lst `seqIntList` return ()

    -- collect any thunks, do the optimization if possible, update stats
    performMajorGC

    rts_stats <- getRTSStats
    putStrLn ("Live data: " ++ show (gcdetails_live_bytes (gc rts_stats)) ++ " bytes")

    -- to make sure our list won't be collected
    lst `seqIntList` return ()

Run it with:

ghc eq.hs -rtsopts -O0 && ./eq +RTS -T

On the second run, disable the first list and enable the second one. You’ll see this output:

$ ghc eq.hs -rtsopts -O0 && ./eq +RTS -T
[1 of 1] Compiling Main             ( eq.hs, eq.o )
Linking eq ...
Live data: 2224 bytes

$ ghc eq.hs -rtsopts -O0 && ./eq +RTS -T
[1 of 1] Compiling Main             ( eq.hs, eq.o )
Linking eq ...
Live data: 2480 bytes

So second program has 256 bytes more live data. Let’s check if that makes sense. The first program doesn’t have any heap-allocated Int closures, because all of the Int in the program are within the range of statically allocated Int closures. Second one has 16 Int closures. An Int closure is two words: a pointer to the I# info table, and an actual integer value in the payload, so that’s 16 bytes. 16 (number of Int closures) * 16 (Int closure size) = 256.

I know at least one another language, Python, does this as well:

Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> x = 1
>>> y = 1
>>> x is y
True
>>> x = 100000000000
>>> y = 100000000000
>>> x is y
False

Although I’m not sure if it does this during garbage collection.

2. Shorting out indirections

This is related with how lazy evaluation is implemented so we’ll first take a look at the generated code for a simple thunk update. When we compile the following program: (to keep things simple we disable optimizations)

fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

main :: IO ()
main = do
    i <- readLn
    print (fib i)

in STG level we get this function:

sat_s31Q =
    \r [i_s31O]
        let { sat_s31P = \u [] fib_rqh i_s31O;
        } in  print $fShowInt sat_s31P;

Here fib_rqh is the fib function, and sat_s31P is the thunk for fib i. First let’s take a look at how this thunk is evaluated in the use site: (Cmm syntax)

I64[Sp - 16] = stg_upd_frame_info;
P64[Sp - 8] = _s31P::P64;
_s31O::P64 = P64[_s31P::P64 + 16];
R2 = _s31O::P64;
Sp = Sp - 16;
call fib_rqh_info(R2) args: 24, res: 0, upd: 24;

So we push the thunk (_s31P), then stg_upd_frame_info to the stack, and jump to the code for the fib function, passing the argument in R2.

I won’t show the code (because it’s large and complex), but the code for fib puts the return value in R1, pops the stack, and jump to the code for the popped stack frame, which is stg_upd_frame_info.

At this point we have the return value of fib in R1, and thunk to update at the bottom of the stack.

The code for stg_upd_frame_info is as follows: (simplified, see the original version here)

INFO_TABLE_RET ( stg_upd_frame, // label
                 UPDATE_FRAME,  // frame type
                 w_ info_ptr,   // info ptr
                 p_ updatee )   // thunk to update at the bottom of the stack
    return (P_ ret) // in R1 we expect the value to update the thunk with
{
    StgInd_indirectee(updatee) = ret;       // (1)
    SET_INFO(updatee, stg_BLACKHOLE_info);  // (2)
    ...
    return (ret);
}

This basically replaces the thunk’s (_s31P) info table pointer with stg_BLACKHOLE_info in line (2) (effectively making the thunk an indirection), and writes pointer to the evaluated object to the payload in line (1).

Now any code that uses this value needs to follow the pointer written to what was originally a thunk in line (1). This is done by the entry code of stg_BLACKHOLE_info.

Now, because the GC copies objects from one heap to another, and updates any references to these moved objects in thread stacks (and in other roots), we can follow any indirections when copying blackhole objects, and replace references in thread stacks to the blackhole object with a reference to the object pointed to by the blackhole object. The code:

case BLACKHOLE:
{
    StgClosure *r;
    const StgInfoTable *i;
    r = ((StgInd*)q)->indirectee;
    if (GET_CLOSURE_TAG(r) == 0) {
        i = r->header.info;
        if (IS_FORWARDING_PTR(i)) {
            r = (StgClosure *)UN_FORWARDING_PTR(i);
            i = r->header.info;
        }
        if (i == &stg_TSO_info
            || i == &stg_WHITEHOLE_info
            || i == &stg_BLOCKING_QUEUE_CLEAN_info
            || i == &stg_BLOCKING_QUEUE_DIRTY_info) {
            copy(p,info,q,sizeofW(StgInd),gen_no);
            return;
        }
        ASSERT(i != &stg_IND_info);
    }
    q = r;
    *p = r;
    goto loop;
}

I don’t understand all the details in this code, but I think the important bits are the q->indirectee line which follows the pointer written in line (1) above, and goto loop which makes the garbage collector copy and return the object pointed by the blackhole.

After this we no longer have to follow a pointer to our evaluated thunk. Instead references to the thunk become references to the evaluated object.

3. Selector thunk evaluation

A selector thunk is a thunk of this form: (code)

case x of
  C x1 ... xn -> xm

where 1 <= m <= n, and x is a variable. The problem with such a thunk is that it keeps all of the fields of x live until the selector thunk is evaluated, even when x is evaluated by some other code. As an example where this happens, suppose we have this record:

data R = R { _i1 :: Int, _i2 :: Int, ... other fields ... }

then in a function we take R as parameter, and use the fields:

f :: R -> Int
f r = _i1 r + _i2 r

Here _i1 r and _i2 r are selector thunks. Now suppose that the parameter to this function was already evaluated before the function is called. In this case the thunk that holds the this function application will keep all of r live even though only _i1 and _i2 are needed.

It turns out this problem was known since around ~~1985~~ 1981. ~~To my knowledge, Wadler was the first one to suggest solving these kind of “leaks” in the garbage collector~~ ¹ (see ² for correction). The idea is that while the GC copies these thunks it checks if the “selectee” is already evaluated. If so the GC evaluates the selector thunk during copying, and copies the evaluated form. Because selectors thunks are so simple (the exact shape of a selector thunk is well specified and it can’t do anything other than accessing a field) evaluation of these are just a matter of indexing the selectee’s payload. The function that does this is here. The whole story is complicated because of concurrency concerns (e.g. another GC thread can also evaluate the thunk at the same time), but the actual optimization starts around line 1104 by looking at info table at the selectee. If it’s a constructor, then we access to the field and return it. Otherwise it’s a thunk and we copy it as usual.

Conclusion

In each cycle a copying garbage collector copies live data in a heap to another heap and abandons the old heap. It turns out this kind of garbage collection is really convenient for implementing optimizations described above. The code that traverses all live data, copies it, and updates the roots is already there. Doing updates on objects while copying is just a matter of adding a few more lines in the copying function.

In a non-copying collector this is much trickier, because the collector doesn’t actually need to update roots or the data. For example, to implement optimizations (2) in a mark-sweep collector we have to somehow keep track of the location where we found the pointer to the object we’re currently marking. Then, if the object became an indirection, we have to update the source location and should not mark the indirection object, because some other object may have a reference to it, and we have to update that reference too. In short, it’s certainly possible, but much trickier. Mark phase gets more complicated.

In summary,

Generational copying collectors are known to be a good fit for functional languages. It turns out if your language is also lazy they’re even better fit.
Laziness have far-reaching consequences. The optimizations (2) and (3) are really essential to get good performance out of lazy programs (try commenting out those lines in the GC!), and they require support from the GC.

This post was submitted to /r/haskell.

“Fixing some space leaks with a garbage collector”, Wadler, 1987.↩︎
Correction: Wadler was the first one to write about it in 1987, but Lennart Augustsson came up with this solution around 1981, and according him David Turner came up with the solution even before him.↩︎

(Show comments)