April 28, 2018 - Tagged as: en, haskell, ghc.
I recently spent some time debugging GHC bug #15038 and to help with context switching between debugging and other tasks I took notes during debugging. After successfully debugging it and reading my notes from the beginning I thought it may be an interesting read, so I’m publishing it. It may not make much sense to anyone other than me, and it’s very lightly edited, so keep your expectations low :)
Crashes at this point:
>>> bt
#11 0x00000000008d6650 in
barf (s=0x94fe90 "evacuate(static): strange closure type %d") at rts/RtsMessages.c:47
#12 0x00000000008e7e83 in
evacuate (p=0x981640) at rts/sm/Evac.c:574
#13 0x0000000000911069 in
scavenge_static () at rts/sm/Scav.c:1831
#14 0x0000000000911638 in
scavenge_loop () at rts/sm/Scav.c:2185
#15 0x00000000008e38fb in
scavenge_until_all_done () at rts/sm/GC.c:1092
#16 0x00000000008e2558 in
GarbageCollect (collect_gen=1, do_heap_census=false, gc_type=0,
cap=0xad40c0 <MainCapability>, idle_cap=0x0) at rts/sm/GC.c:418
#17 0x00000000008d523b in
scheduleDoGC (pcap=0x7ffeced3a4a0, task=0x2620cc0, force_major=false) at rts/Schedule.c:1799
#18 0x00000000008d4787 in
schedule (initialCapability=0xad40c0 <MainCapability>, task=0x2620cc0) at rts/Schedule.c:545
#19 0x00000000008d5be1 in
scheduleWaitThread (tso=0x4200105388, ret=0x0, pcap=0x7ffeced3a590) at rts/Schedule.c:2533
#20 0x00000000008d7dd7 in
rts_evalLazyIO (cap=0x7ffeced3a590, p=0x981730, ret=0x0) at rts/RtsAPI.c:530
#21 0x00000000008d84ca in
hs_main (argc=1, argv=0x7ffeced3a788, main_closure=0x981730, rts_config=...) at rts/RtsMain.c:72
#22 0x000000000040d7da in
main ()
While evacuating object at location
p = (StgClosure **) 0x981640
Info table of the object:
>>> print *get_itbl(q)
$5 = {
layout = {
payload = {
ptrs = 0,
nptrs = 134217728
},
bitmap = 576460752303423488,
large_bitmap_offset = 0,
__pad_large_bitmap_offset = 0,
selector_offset = 576460752303423488
},
type = 16585,
srt_bitmap = 419430400,
code = 0x981619 "\026\230"
}
Untagged object:
q = (StgClosure *) 0x981618
Tagged:
*p = (StgClosure *) 0x981619
Where does this object appear? Tagged:
>>> find 0x4200000000, 0x14200000000, (void*)0x981619
(nothing)
Untagged:
>>> find 0x4200000000, 0x14200000000, (void*)0x981618
(nothing)
Not in the heap!
>>> info symbol 0x981618
rctv_closure in section .data of /home/omer/.local/share/rr/Main-31/mmap_hardlink_3_Main
According to STG output references should be
lvl4_rctv referenced by
lvl5_rctw referenced by
lvl6_rctx referenced by
$wbyteParserBadOnce
Looking at each closure:
rctv :: ([Char], Packed.Bytes.Parser.Parser Word)
>>> print (StgClosure)rctv_closure (0x981618)
{
header = {
info = 0x8c2960 <ghczmprim_GHCziTuple_Z2T_con_info>
},
payload = [0x0, 0x0] -- not evaluated yet?
}
>>> print *get_itbl(&rctv_closure)
{
layout = {
payload = {
ptrs = 2,
nptrs = 0
},
bitmap = 2,
large_bitmap_offset = 2,
__pad_large_bitmap_offset = 2,
selector_offset = 2
},
type = 4, -- CONSTR_2_0
srt_bitmap = 0,
code = 0x8c2960 <ghczmprim_GHCziTuple_Z2T_con_info> "H\377\303\377e"
}
rctw :: [([Char], Packed.Bytes.Parser.Parser Word)]
>>> print (StgClosure)rctw_closure (0x981638)
{
header = {
info = 0x8c3b80 <ghczmprim_GHCziTypes_ZC_con_info>
},
payload = [0x0, 0x0] -- not evaluated yet?
}
>>> print *get_itbl(&rctw_closure)
{
layout = {
payload = {
ptrs = 2,
nptrs = 0
},
bitmap = 2,
large_bitmap_offset = 2,
__pad_large_bitmap_offset = 2,
selector_offset = 2
},
type = 4, -- CONSTR_2_0
srt_bitmap = 1,
code = 0x8c3b80 <ghczmprim_GHCziTypes_ZC_con_info> "H\203\303\002\377e"
}
rctx :: Packed.Bytes.Parser.Parser Word
>>> print (StgClosure)rctx_closure (0x981658)
{
header = {
info = 0x40ca10 <rctx_info>
-- Packed.Bytes.Parser.Parser GHC.Types.Word
},
payload = []
}
>>> print *get_itbl(&rctx_closure)
{
layout = {
payload = {
ptrs = 0,
nptrs = 0
},
bitmap = 0,
large_bitmap_offset = 0,
__pad_large_bitmap_offset = 0,
selector_offset = 0
},
type = 21, -- THUNK_STATIC
srt_bitmap = 3,
code = 0x40ca10 <rctx_info> "H\215E\350L9\370r@H\203\354\bL\211\350H\211\336H\211\307\061\300\350S\314M"
}
$wbyteParserBadOnce :: Int -> Int#
>>> print (StgClosure)Parser_zdwbyteParserBadOnce_closure (0x981680)
{
header = {
info = 0x40d0b0 <Parser_zdwbyteParserBadOnce_info>
},
payload = 0x8
}
>>> print *get_itbl(&Parser_zdwbyteParserBadOnce_closure)
{
layout = {
payload = {
ptrs = 0,
nptrs = 0
},
bitmap = 0,
large_bitmap_offset = 0,
__pad_large_bitmap_offset = 0,
selector_offset = 0
},
type = 14, -- FUN_STATIC
srt_bitmap = 113,
code = 0x40d0b0 <Parser_zdwbyteParserBadOnce_info> "H\215E\320L9\370\017\202\342\001"
}
Back to the backtrace. Adding watchpoints makes it run too slow, so first figure out when this happens:
>>> break GarbageCollect
Breakpoint 2 at 0x8e2144: file rts/sm/GC.c, line 226.
>>> ignore 2 10000000
Will ignore next 10000000 crossings of breakpoint 2.
>>> c
>>> info breakpoints
Num Type Disp Enb Address What
2 breakpoint keep y 0x00000000008e2144 in GarbageCollect at rts/sm/GC.c:226
breakpoint already hit 6 times
ignore next 9999994 hits
So 6th time we run GC we see this error. Let’s see if 0x981640 is a valid closure by the beginning of 6th GC:
>>> call LOOKS_LIKE_CLOSURE_PTR(0x981640)
$5 = true
However at this point it already has wrong type:
>>> print *get_itbl((StgClosure*)0x981640)
$4 = {
layout = {
payload = {
ptrs = 0,
nptrs = 134217728
},
bitmap = 576460752303423488,
large_bitmap_offset = 0,
__pad_large_bitmap_offset = 0,
selector_offset = 576460752303423488
},
type = 16585,
srt_bitmap = 419430400,
code = 0x981619 "\026\230"
}
In fact, this object has this type since the beginning.
>>> info symbol 0x981640
rctw_closure + 8 in section .data of /home/omer/.local/share/rr/Main-31/mmap_hardlink_3_Main
This object is in payload[0]
of rctw
?
lvl5_rctw
:: [(GHC.Base.String, Packed.Bytes.Parser.Parser GHC.Types.Word)]
[GblId, Str=m2, Unf=OtherCon []] =
CCS_DONT_CARE :! [lvl4_rctv GHC.Types.[]];
So it should be rctv
. Something is wrong with first payload of rctw
. We should focus on evacuation/scavenging of rctw
.
>>> break evacuate_static_object if q == 0x981638
>>> c
** hits breakpoint **
>>> call printClosure(q)
ghc-prim:GHC.Types.:(0x981619, 0xace111)
>>> info symbol 0x981619
rctv_closure + 1 in section .data of /home/omer/.local/share/rr/Main-31/mmap_hardlink_3_Main
So indeed we have rctv_closure
in first payload (but tagged).
>>> call printClosure(0x981619)
ghc-prim:GHC.Tuple.(,)(0x9815f0, 0x981616)
>>> call printClosure(0xace111)
ghc-prim:GHC.Types.[](0x8c3450#)
It looks OK. Let’s see if it breaks after this GC.
>>> call printClosure(0x981638)
ghc-prim:GHC.Types.:(0x981619, 0xace111)
Nope. Continue until first payload of 0x981638 (rctw
) changes.
It turns out the payload doesn’t change. So rctv
itself changes!
On 4th GarbageCollect()
this fails:
>>> call printClosure(0x981619)
*** printClosure: unknown type 16585 ****
Main: internal error: printClosure 16585
(GHC version 8.5.20180425 for x86_64_unknown_linux)
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
So break at 3rd:
>>> break GarbageCollect
Breakpoint 1 at 0x8e2144: file rts/sm/GC.c, line 226.
>>> ignore 1 2
Will ignore next 2 crossings of breakpoint 1.
>>> c
>>> call printClosure(0x981619)
ghc-prim:GHC.Tuple.(,)(0x9815f0, 0x981616)
(Remember that 0x981619 is tagged 0x981618 which is rctv
)
Now watch the closure:
>>> watch (StgClosure)rctv_closure
Watchpoint 2: (StgClosure)rctv_closure
>>> disable breakpoint 1
>>> c
This takes forever, gdb starts leaking memory, fills 16G and swaps.
Let’s take a look at static objects again. rctv
’s info table pointer somehow gets overwritten. What could cause this? Perhaps an object before rctv
has an overlapping field.
>>> info symbol 0x981617
rctu_closure + 7 in section .data of /home/omer/.local/share/rr/Main-31/mmap_hardlink_3_Main
It seems like we have rctu
before rctv
.
lvl3_rctu
:: forall s.
Packed.Bytes.Parser.Maybe# (Packed.Bytes.Parser.Leftovers# s)
-> GHC.Prim.State# s
-> (# GHC.Prim.State# s,
Packed.Bytes.Parser.Result# s GHC.Types.Word #)
>>> print (StgClosure)rctu_closure
{
header = {
info = 0x40c908 <rctu_info>
},
payload = 0x8
}
>>> print *get_itbl(&rctu_closure)
{
layout = {
payload = {
ptrs = 0,
nptrs = 0
},
bitmap = 0,
large_bitmap_offset = 0,
__pad_large_bitmap_offset = 0,
selector_offset = 0
},
type = 14, -- FUN_STATIC
srt_bitmap = 1,
code = 0x40c908 <rctu_info> "H\215E\370L9\370\017\202\277"
}
So locations of rctu
and rctv
rctu: 0x981610
rctv: 0x981618
There’s only 8 bytes in between. This doesn’t look right. Because rctu
is a static object with no payload, so it needs at least two words: info table pointer and static link field. Static link field is missing in the layout, so info table pointer of rctv
is used as static link.
At this point I remember different types of objects have link fields in different places so let’s check:
#define FUN_STATIC_LINK(p) (&(p)->payload[0])
Because I can’t add watchpoint to this location so I do this:
>>> break evacuate_static_object if q == 0x981610
Just to count which GC we’re at
>>> break GarbageCollect
>>> ignore 2 10000
First hit
>>> print (StgClosure)rctv_closure
$1 = {
header = {
info = 0x8c2960 <ghczmprim_GHCziTuple_Z2T_con_info>
},
payload = 0x8
}
>>> info breakpoints
Num Type Disp Enb Address What
1 breakpoint keep y 0x00000000008e7a73 in evacuate_static_object at rts/sm/Evac.c:340
stop only if q == 0x981610 (target evals)
breakpoint already hit 1 time
2 breakpoint keep y 0x00000000008e2144 in GarbageCollect at rts/sm/GC.c:226
breakpoint already hit 3 times
ignore next 9997 hits
Let’s see if this breaks it:
>>> fin
>>> print (StgClosure)rctv_closure
$2 = {
header = {
info = 0x981561
},
payload = 0x8
}
Yep! So this function call breaks rctv:
>>> bt
#0 evacuate_static_object (link_field=0x981618, q=0x981610 (rctu)) at rts/sm/Evac.c:340
#1 0x00000000008e7e20 in
evacuate (p=0x981628) at rts/sm/Evac.c:546
#2 0x0000000000911069 in
scavenge_static () at rts/sm/Scav.c:1831
#3 0x0000000000911638 in
scavenge_loop () at rts/sm/Scav.c:2185
#4 0x00000000008e38fb in
scavenge_until_all_done () at rts/sm/GC.c:1092
#5 0x00000000008e2558 in
GarbageCollect (collect_gen=1, do_heap_census=false, gc_type=0,
cap=0xad40c0 <MainCapability>, idle_cap=0x0) at rts/sm/GC.c:418
#6 0x00000000008d523b in
scheduleDoGC (pcap=0x7ffeced3a4a0, task=0x2620cc0, force_major=false) at rts/Schedule.c:1799
#7 0x00000000008d4787 in
schedule (initialCapability=0xad40c0 <MainCapability>, task=0x2620cc0) at rts/Schedule.c:545
#8 0x00000000008d5be1 in
scheduleWaitThread (tso=0x4200105388, ret=0x0, pcap=0x7ffeced3a590) at rts/Schedule.c:2533
#9 0x00000000008d7dd7 in
rts_evalLazyIO (cap=0x7ffeced3a590, p=0x981730, ret=0x0) at rts/RtsAPI.c:530
#10 0x00000000008d84ca in
hs_main (argc=1, argv=0x7ffeced3a788, main_closure=0x981730, rts_config=...) at rts/RtsMain.c:72
#11 0x000000000040d7da in
main ()
I’m convinced that layout of rctu
is wrong. I’ll add some debug prints to GHC to see why the layout is generated this way, but before that I’m saving all the binaries and dump files of this session.
The bug should be in cgTopRhsClosure
, the Cmm code generator for top-level closures.
Good thing GHC is deterministic enough these days so I get the same symbols when I recompile (later I realized that top-level symbols are deterministic enough but local ids usually change).
gen_code
closure_label: lvl3_rctu_closure
descr: <Parser.lvl3_rctu>
info_tbl: label: lvl3_rctu_info
rep:HeapRep static {
Fun {arity: 6 fun_type: ArgGen [True, False, False, True, True]} }
fv_details: []
This looks fine. Let’s check mkStaticClosureFields
:
mkStaticClosureFields
caf_refs: NoCafRefs
info_lbl: lvl3_rctu_info
is_caf: False
padding: []
static_link_field: []
saved_info_field: []
static_link_value: 3
So this object doesn’t get a static link field. This is because the condition
staticClosureNeedsLink (mayHaveCafRefs caf_refs) info_tbl
doesn’t hold.
staticClosureNeedsLink :: Bool -> CmmInfoTable -> Bool
-- A static closure needs a link field to aid the GC when traversing
-- the static closure graph. But it only needs such a field if either
-- a) it has an SRT
-- b) it's a constructor with one or more pointer fields
-- In case (b), the constructor's fields themselves play the role
-- of the SRT.
CmmInfoTable{ cit_rep = smrep }
staticClosureNeedsLink has_srt | isConRep smrep = not (isStaticNoCafCon smrep)
| otherwise = has_srt -- needsSRT (cit_srt info_tbl)
So has_srt
is False
for this object and idCafInfo
of this closure says it’s not caffy.
Just to make sure this is the problem, I modify the code to treat all top-level closures as caffy. Indeed that fixes the bug.
CAFness of bindings are decided in hasCafRefs
in TidyPgm
. Two conditions for CAFness are
First one obviously doesn’t hold becasue rctu
is a function. I don’t know why the second condition doesn’t hold yet.
TidyPgm
operates on Core so let’s look at Core of rctu
:
-- RHS size: {terms: 80, types: 1,207, coercions: 0, joins: 0/0}
lvl3_rctu
:: forall s.
Packed.Bytes.Parser.Maybe# (Packed.Bytes.Parser.Leftovers# s)
-> GHC.Prim.State# s
-> (# GHC.Prim.State# s, Packed.Bytes.Parser.Result# s Word #)
[GblId, Arity=2, Caf=NoCafRefs, Str=<S,1*U><S,U>, Unf=OtherCon []]
lvl3_rctu
= \ (@ s_a6Ca)
(leftovers0_a691
:: Packed.Bytes.Parser.Maybe#
(Packed.Bytes.Parser.Leftovers# s_a6Ca))
(s0_a692 :: GHC.Prim.State# s_a6Ca) ->
case leftovers0_a691 of {
(#_|#) ds_d6Km ->
(# s0_a692,
(# GHC.Prim.(#_|#)
@ ('GHC.Types.TupleRep '[])
@ ('GHC.Types.TupleRep
'['GHC.Types.TupleRep
'['GHC.Types.UnliftedRep, 'GHC.Types.IntRep, 'GHC.Types.IntRep],
'GHC.Types.LiftedRep])
@ (# #)
@ (Packed.Bytes.Parser.Leftovers# s_a6Ca)
GHC.Prim.(##),
GHC.Prim.(#_|#)
@ ('GHC.Types.TupleRep '[])
@ 'GHC.Types.LiftedRep
@ (# #)
@ Word
GHC.Prim.(##) #) #);
(#|_#) ds_d6Kn ->
case ds_d6Kn of { (# bytes0_scqD, stream0_scqE #) ->
case bytes0_scqD of { (# arr0_scqH, off0_scqI, len0_scqJ #) ->
case GHC.Prim.># len0_scqJ 0# of {
__DEFAULT ->
case Packed.Bytes.Parser.nextNonEmpty @ s_a6Ca stream0_scqE s0_a692
of
{ (# ipv_s6NZ, ipv1_s6O0 #) ->
case ipv1_s6O0 of {
(#_|#) ds3_d6JW ->
(# ipv_s6NZ,
(# GHC.Prim.(#_|#)
@ ('GHC.Types.TupleRep '[])
@ ('GHC.Types.TupleRep
'['GHC.Types.TupleRep
'['GHC.Types.UnliftedRep, 'GHC.Types.IntRep, 'GHC.Types.IntRep],
'GHC.Types.LiftedRep])
@ (# #)
@ (Packed.Bytes.Parser.Leftovers# s_a6Ca)
GHC.Prim.(##),
GHC.Prim.(#_|#)
@ ('GHC.Types.TupleRep '[])
@ 'GHC.Types.LiftedRep
@ (# #)
@ Word
GHC.Prim.(##) #) #);
(#|_#) ds3_d6JX ->
case ds3_d6JX of { (# bytes1_scqM, stream1_scqN #) ->
case bytes1_scqM of { (# arr1_scqQ, off1_scqR, ds4_scqS #) ->
(# ipv_s6NZ,
(# GHC.Prim.(#|_#)
@ ('GHC.Types.TupleRep '[])
@ ('GHC.Types.TupleRep
'['GHC.Types.TupleRep
'['GHC.Types.UnliftedRep, 'GHC.Types.IntRep, 'GHC.Types.IntRep],
'GHC.Types.LiftedRep])
@ (# #)
@ (Packed.Bytes.Parser.Leftovers# s_a6Ca)
(# (# arr1_scqQ, GHC.Prim.+# off1_scqR 1#,
GHC.Prim.-# ds4_scqS 1# #),
stream1_scqN #),
GHC.Prim.(#|_#)
@ ('GHC.Types.TupleRep '[])
@ 'GHC.Types.LiftedRep
@ (# #)
@ Word
a1_rcbR #) #)
}
}
}
};
1# ->
(# s0_a692,
(# GHC.Prim.(#|_#)
@ ('GHC.Types.TupleRep '[])
@ ('GHC.Types.TupleRep
'['GHC.Types.TupleRep
'['GHC.Types.UnliftedRep, 'GHC.Types.IntRep, 'GHC.Types.IntRep],
'GHC.Types.LiftedRep])
@ (# #)
@ (Packed.Bytes.Parser.Leftovers# s_a6Ca)
(# (# arr0_scqH, GHC.Prim.+# off0_scqI 1#,
GHC.Prim.-# len0_scqJ 1# #),
stream0_scqE #),
GHC.Prim.(#|_#)
@ ('GHC.Types.TupleRep '[])
@ 'GHC.Types.LiftedRep
@ (# #)
@ Word
a1_rcbR #) #)
}
}
}
}
This is huge and I’m not going to check this by hand. However at this point I realize that the condition for evacuating a FUN_STATIC
is
case FUN_STATIC:
if (info->srt_bitmap != 0) {
evacuate_static_object(FUN_STATIC_LINK((StgClosure *)q), q);
}
return;
So this is not marked as CAF, but its SRT bitmap is not empty. Perhaps the bug is in the SRT bitmap.
I also try this in gdb:
>>> set get_itbl(&rctu_closure)->srt_bitmap = 0
This also fixes it. Now let’s figure out why SRT bitmap of rctu
(which is supposed to be a non-CAF) is 1.
I look again at the definition of rctu
and it seems like a1_rcbR
is a free variable.
a1_rcbR :: GHC.Types.Word
[GblId, Caf=NoCafRefs, Str=m, Unf=OtherCon []] =
CCS_DONT_CARE GHC.Types.W#! [5##];
However because this is not a CAF it shouldn’t make rctu
CAF.
I stare at the code aimlessly for a while until I realize something interesting. This is the STG for rctu
after unarisation:
lvl3_rctu
:: forall s.
Packed.Bytes.Parser.Maybe# (Packed.Bytes.Parser.Leftovers# s)
-> GHC.Prim.State# s
-> (# GHC.Prim.State# s,
Packed.Bytes.Parser.Result# s GHC.Types.Word #)
[GblId,
Arity=2,
Caf=NoCafRefs,
Str=<S,1*U><S,U>,
Unf=OtherCon []] =
[] \r [us_gcwc us_gcwd us_gcwe us_gcwf us_gcwg void_0E]
case us_gcwc of tag_gcwh {
__DEFAULT ->
(#,,,,,,#) [1#
Control.Exception.Base.absentError
Control.Exception.Base.absentError
0##
0##
1#
Control.Exception.Base.absentError];
2# ->
case ># [us_gcwg 0#] of {
__DEFAULT ->
case Packed.Bytes.Parser.nextNonEmpty us_gcwe GHC.Prim.void# of {
(#,,,,#) us_gcwi us_gcwj us_gcwk us_gcwl us_gcwm ->
case us_gcwi of tag_gcwn {
__DEFAULT ->
(#,,,,,,#) [1#
Control.Exception.Base.absentError
Control.Exception.Base.absentError
0##
0##
1#
Control.Exception.Base.absentError];
2# ->
case -# [us_gcwm 1#] of sat_scuk {
__DEFAULT ->
case +# [us_gcwl 1#] of sat_scuj {
__DEFAULT ->
(#,,,,,,#) [2#
us_gcwj
us_gcwk
sat_scuj
sat_scuk
2#
a1_rcbR];
};
};
};
};
1# ->
case -# [us_gcwg 1#] of sat_scur {
__DEFAULT ->
case +# [us_gcwf 1#] of sat_scuq {
__DEFAULT ->
(#,,,,,,#) [2# us_gcwd us_gcwe sat_scuq sat_scur 2# a1_rcbR];
};
};
};
};
This has two free variables: a1_rcbR
and Control.Exception.Base.absentError
. a1_rcbR
is clearly not a CAF, but maybe absentError
is? Because absentError
references are generated in unarise, which happens much later than TidyPgm
(right before generating Cmm), TidyPgm
can’t take those into account.
Id
for absentError
is generated using vanillaIdInfo
which sets the CAF-ness field as MayHaveCafRefs
. So clearly absentError
is a CAFFY.
So in summary, we introduce references to CAFs in UnariseStg
which happens later than TidyPgm
. The code generator is somehow handles this correctly by generating a SRT for this function and setting the info table bitmap correctly. However object layout generator (in cgTopRhsClosure
) only uses the information generated in TidyPgm
, so we get this mismatch of SRT information of this function.
Some ways to fix this:
absentError
TidyPgm
absentError
as non-CAF (I don’t know if it has to be a CAF, I think error ids are conservatively marked as CAFs)The whole thing took 4:47 hours.