osa1 feed

The GHC bug that ruined my day

May 27, 2014 - Tagged as: haskell, ghc.

I just came across a very serious GHC bug that prevents me from making progress on my GHCJS project. It looks like it’s very hard to come up with a minimal program that reproduces the bug. Here’s a very long way to reproduce it:

[  5 of 202] Compiling GHC.Unicode[boot] ( GHC/Unicode.hs-boot, dist/build/GHC/Unicode.js_p_o-boot )
Detaching after fork from child process 3382.
[  6 of 202] Compiling GHC.IO[boot]     ( GHC/IO.hs-boot, dist/build/GHC/IO.js_p_o-boot )
Detaching after fork from child process 3383.
[  7 of 202] Compiling GHC.Exception[boot] ( GHC/Exception.lhs-boot, dist/build/GHC/Exception.js_p_o-boot )
Detaching after fork from child process 3384.
[ 51 of 202] Compiling GHC.Fingerprint[boot] ( GHC/Fingerprint.hs-boot, dist/build/GHC/Fingerprint.js_p_o-boot )
Detaching after fork from child process 3385.
[ 55 of 202] Compiling GHC.IO.Exception[boot] ( GHC/IO/Exception.hs-boot, dist/build/GHC/IO/Exception.js_p_o-boot )
Detaching after fork from child process 3386.
[ 75 of 202] Compiling Foreign.C.Types  ( Foreign/C/Types.hs, dist/build/Foreign/C/Types.js_p_o )

Program received signal SIGSEGV, Segmentation fault.
0x000000000425d5c4 in LOOKS_LIKE_CLOSURE_PTR (p=0x0) at includes/rts/storage/ClosureMacros.h:258
258     includes/rts/storage/ClosureMacros.h: No such file or directory.
(gdb) bt
#0  0x000000000425d5c4 in LOOKS_LIKE_CLOSURE_PTR (p=0x0) at includes/rts/storage/ClosureMacros.h:258
#1  0x000000000425f776 in scavenge_mutable_list1 (bd=0x7fffe5c02a00, gen=0x4d1fd48) at rts/sm/Scav.c:1400
#2  0x000000000425fa13 in scavenge_capability_mut_Lists1 (cap=0x4cfe5c0 <MainCapability>) at rts/sm/Scav.c:1493
#3  0x0000000004256b66 in GarbageCollect (collect_gen=0, do_heap_census=rtsFalse, gc_type=2,
    cap=0x4cfe5c0 <MainCapability>) at rts/sm/GC.c:342
#4  0x00000000042454a3 in scheduleDoGC (pcap=0x7fffffffc198, task=0x4d32b60, force_major=rtsFalse)
    at rts/Schedule.c:1650
#5  0x0000000004243de4 in schedule (initialCapability=0x4cfe5c0 <MainCapability>, task=0x4d32b60)
    at rts/Schedule.c:553
#6  0x0000000004246436 in scheduleWaitThread (tso=0x7ffff6708d60, ret=0x0, pcap=0x7fffffffc2c0) at rts/Schedule.c:2346
#7  0x000000000423e9b4 in rts_evalLazyIO (cap=0x7fffffffc2c0, p=0x477f850, ret=0x0) at rts/RtsAPI.c:500
#8  0x0000000004241666 in real_main () at rts/RtsMain.c:63
#9  0x0000000004241759 in hs_main (argc=237, argv=0x7fffffffc448, main_closure=0x477f850, rts_config=...)
    at rts/RtsMain.c:114
#10 0x0000000000408ea7 in main ()

Thanks to Luite Stegeman for helping me with debugging. We could reproduce this error on 64bit Linux and 64bit OS X. I don’t know how to track down this bug but it just made making progress in my project impossible.


EDIT: I just found a workaround: run GHCJS with +RTS -G1. I was reading code of functions in the backtrace and I realized some RTS parameters are used. By playing with them changing randomly, I found this. Some details about this parameter is explained in GHC Trac GC section(see “generational gc” parts). Unfortunately with this RTS parameter GHCJS runs at least 100x slower(at least in debug build) so even though I can boot GHCJS with my changes it’s not very usable since I sometimes boot it 10 times a day in this stage of the development.


EDIT 2: My workaround did not work great – now compilation is failing in some other stage:

Failed to install base-4.7.0.0
Last 10 lines of the build log ( /home/omer/.cabal/logs/base-4.7.0.0.log ):
[  4 of 202] Compiling Unsafe.Coerce    ( Unsafe/Coerce.hs, dist/build/Unsafe/Coerce.js_p_o )
[  5 of 202] Compiling GHC.Unicode[boot] ( GHC/Unicode.hs-boot, dist/build/GHC/Unicode.js_p_o-boot )
[  6 of 202] Compiling GHC.IO[boot]     ( GHC/IO.hs-boot, dist/build/GHC/IO.js_p_o-boot )
[  7 of 202] Compiling GHC.Exception[boot] ( GHC/Exception.lhs-boot, dist/build/GHC/Exception.js_p_o-boot )
[  8 of 202] Compiling GHC.Err          ( GHC/Err.lhs, dist/build/GHC/Err.js_p_o )
[  9 of 202] Compiling GHC.Base         ( GHC/Base.lhs, dist/build/GHC/Base.js_p_o )
[ 10 of 202] Compiling Data.Maybe       ( Data/Maybe.hs, dist/build/Data/Maybe.js_p_o )
[ 11 of 202] Compiling GHC.List         ( GHC/List.lhs, dist/build/GHC/List.js_p_o )
[ 12 of 202] Compiling GHC.Num          ( GHC/Num.lhs, dist/build/GHC/Num.js_p_o )
[ 13 of 202] Compiling GHC.Show         ( GHC/Show.lhs, dist/build/GHC/Show.js_p_o )
cabal: Error: some packages failed to install:
base-4.7.0.0 failed during the building phase. The exception was:
ExitFailure (-11)

However, when I tried to run the command that led to this segfault in gdb to see if it gives a different backtrace, it worked fine and no segfaults happened.


EDIT 3: There has been some changes in GHCJS codegen and I wanted to try to reproduce this bug using that new codegen, hoping that it may be a workaround for the segfault(I rebased my patches for new version). It’s still happening, but now in different file:

[112 of 202] Compiling System.Posix.Types ( System/Posix/Types.hs, dist/build/System/Posix/Types.js_p_o )

Program received signal SIGSEGV, Segmentation fault.
0x0000000004254434 in LOOKS_LIKE_CLOSURE_PTR (p=0x0) at includes/rts/storage/ClosureMacros.h:258
258     includes/rts/storage/ClosureMacros.h: No such file or directory.
(gdb) bt
#0  0x0000000004254434 in LOOKS_LIKE_CLOSURE_PTR (p=0x0) at includes/rts/storage/ClosureMacros.h:258
#1  0x00000000042565e6 in scavenge_mutable_list1 (bd=0x7fffe5800c40, gen=0x4d17d48) at rts/sm/Scav.c:1400
#2  0x0000000004256883 in scavenge_capability_mut_Lists1 (cap=0x4cf6340 <MainCapability>) at rts/sm/Scav.c:1493
#3  0x000000000424d9d6 in GarbageCollect (collect_gen=0, do_heap_census=rtsFalse, gc_type=2, 
    cap=0x4cf6340 <MainCapability>) at rts/sm/GC.c:342
#4  0x000000000423c313 in scheduleDoGC (pcap=0x7fffffffc198, task=0x4d2ab60, force_major=rtsFalse)
    at rts/Schedule.c:1650
#5  0x000000000423ac54 in schedule (initialCapability=0x4cf6340 <MainCapability>, task=0x4d2ab60)
    at rts/Schedule.c:553
#6  0x000000000423d2a6 in scheduleWaitThread (tso=0x7ffff6708d60, ret=0x0, pcap=0x7fffffffc2c0) at rts/Schedule.c:2346
#7  0x0000000004235824 in rts_evalLazyIO (cap=0x7fffffffc2c0, p=0x4776850, ret=0x0) at rts/RtsAPI.c:500
#8  0x00000000042384d6 in real_main () at rts/RtsMain.c:63
#9  0x00000000042385c9 in hs_main (argc=237, argv=0x7fffffffc448, main_closure=0x4776850, rts_config=...)
    at rts/RtsMain.c:114
#10 0x0000000000408ea7 in main ()

Backtrace is same. I think it’s interesting that this problem is happening while compiling Types modules in 2/3 of the cases(Foreign.C.Types and System.Posix.Types). As far as I can see this two modules are using INTEGRAL_TYPE extensively, which creates a newtype and implements Read and Show instances. Other problematic module, GHC.Show also implements same instances, but does that manually instead of calling INTEGRAL_TYPE. I wonder if it could be related with that.


EDIT 4: When I run GHCJS using --no-native, it failed with an assertion error rather than a segfault:

[112 of 202] Compiling System.Posix.Types ( System/Posix/Types.hs, dist/build/System/Posix/Types.js_p_o )
ghcjs: internal error: ASSERTION FAILED: file rts/sm/Scav.c, line 1400

    (GHC version 7.8.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

Program received signal SIGABRT, Aborted.
0x00007ffff687f849 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff687f849 in raise () from /lib64/libc.so.6
#1  0x00007ffff6880cd8 in abort () from /lib64/libc.so.6
#2  0x0000000004238a27 in rtsFatalInternalErrorFn (s=0x4554e60 "ASSERTION FAILED: file %s, line %u\n", 
    ap=0x7fffffffbe58) at rts/RtsMessages.c:170
#3  0x000000000423865f in barf (s=0x4554e60 "ASSERTION FAILED: file %s, line %u\n") at rts/RtsMessages.c:42
#4  0x00000000042386c2 in _assertFail (filename=0x4559fbd "rts/sm/Scav.c", linenum=1400) at rts/RtsMessages.c:57
#5  0x00000000042565e9 in scavenge_mutable_list1 (bd=0x7fffe7402dc0, gen=0x4d15d88) at rts/sm/Scav.c:1400
#6  0x0000000004256873 in scavenge_capability_mut_Lists1 (cap=0x4cf49c0 <MainCapability>) at rts/sm/Scav.c:1493
#7  0x000000000424d9c6 in GarbageCollect (collect_gen=0, do_heap_census=rtsFalse, gc_type=2, 
    cap=0x4cf49c0 <MainCapability>) at rts/sm/GC.c:342
#8  0x000000000423c303 in scheduleDoGC (pcap=0x7fffffffc188, task=0x4d28ba0, force_major=rtsFalse)
    at rts/Schedule.c:1650
#9  0x000000000423ac44 in schedule (initialCapability=0x4cf49c0 <MainCapability>, task=0x4d28ba0)
    at rts/Schedule.c:553
#10 0x000000000423d296 in scheduleWaitThread (tso=0x7ffff6708d60, ret=0x0, pcap=0x7fffffffc2b0) at rts/Schedule.c:2346
#11 0x0000000004235814 in rts_evalLazyIO (cap=0x7fffffffc2b0, p=0x4776850, ret=0x0) at rts/RtsAPI.c:500
#12 0x00000000042384c6 in real_main () at rts/RtsMain.c:63
#13 0x00000000042385b9 in hs_main (argc=238, argv=0x7fffffffc438, main_closure=0x4776850, rts_config=...)
    at rts/RtsMain.c:114
#14 0x0000000000408ea7 in main ()

Not passing --not-native is resulting with a segfault and same backtrace as before.