May 27, 2014 - Tagged as: haskell, ghc.
I just came across a very serious GHC bug that prevents me from making progress on my GHCJS project. It looks like it’s very hard to come up with a minimal program that reproduces the bug. Here’s a very long way to reproduce it:
Install patched Cabal for GHCJS.
Clone my fork of GHCJS, switch to work
branch and install it.
Delete ~/.ghcjs
folder if you already have one.
Run ghcjs-boot --init --prof -v
and wait for it to fail with segmentation fault while compiling profiling object files for base
library. Segmentation fault is reported as ExitFailure (-11)
by Cabal.
Since we run ghcjs-boot
in verbose mode(using -v
parameter), at this point it should have printed what command it was running when the error occured. It should be a very very long command, starting with something like /home/omer/.cabal/bin/ghcjs --make -fbuilding-cabal-package -O -prof ...
. Copy that command because that’s how we reproduce the error.
Now to trace the bug, we need to compile GHCJS for debugging. Note that booting GHCJS with debugging build takes forever, so we first installed GHCJS and built some libraries using that fast build. To compile for debugging, first run cabal clean
and then cabal install --disable-executable-stripping --ghc-options=-debug
.
Move to ~/.ghcjs/<your platform>/ghcjs-boot/boot/base
and run the command you copied from ghcjs-boot
verbose output when the error happened for first time. You should get the same error very fast.
At that point you can use gdb and friends. On my system gdb backtrace gives this:
[ 5 of 202] Compiling GHC.Unicode[boot] ( GHC/Unicode.hs-boot, dist/build/GHC/Unicode.js_p_o-boot )
Detaching after fork from child process 3382.
[ 6 of 202] Compiling GHC.IO[boot] ( GHC/IO.hs-boot, dist/build/GHC/IO.js_p_o-boot )
Detaching after fork from child process 3383.
[ 7 of 202] Compiling GHC.Exception[boot] ( GHC/Exception.lhs-boot, dist/build/GHC/Exception.js_p_o-boot )
Detaching after fork from child process 3384.
[ 51 of 202] Compiling GHC.Fingerprint[boot] ( GHC/Fingerprint.hs-boot, dist/build/GHC/Fingerprint.js_p_o-boot )
Detaching after fork from child process 3385.
[ 55 of 202] Compiling GHC.IO.Exception[boot] ( GHC/IO/Exception.hs-boot, dist/build/GHC/IO/Exception.js_p_o-boot )
Detaching after fork from child process 3386.
[ 75 of 202] Compiling Foreign.C.Types ( Foreign/C/Types.hs, dist/build/Foreign/C/Types.js_p_o )
Program received signal SIGSEGV, Segmentation fault.
0x000000000425d5c4 in LOOKS_LIKE_CLOSURE_PTR (p=0x0) at includes/rts/storage/ClosureMacros.h:258
258 includes/rts/storage/ClosureMacros.h: No such file or directory.
(gdb) bt
#0 0x000000000425d5c4 in LOOKS_LIKE_CLOSURE_PTR (p=0x0) at includes/rts/storage/ClosureMacros.h:258
#1 0x000000000425f776 in scavenge_mutable_list1 (bd=0x7fffe5c02a00, gen=0x4d1fd48) at rts/sm/Scav.c:1400
#2 0x000000000425fa13 in scavenge_capability_mut_Lists1 (cap=0x4cfe5c0 <MainCapability>) at rts/sm/Scav.c:1493
#3 0x0000000004256b66 in GarbageCollect (collect_gen=0, do_heap_census=rtsFalse, gc_type=2,
cap=0x4cfe5c0 <MainCapability>) at rts/sm/GC.c:342
#4 0x00000000042454a3 in scheduleDoGC (pcap=0x7fffffffc198, task=0x4d32b60, force_major=rtsFalse)
at rts/Schedule.c:1650
#5 0x0000000004243de4 in schedule (initialCapability=0x4cfe5c0 <MainCapability>, task=0x4d32b60)
at rts/Schedule.c:553
#6 0x0000000004246436 in scheduleWaitThread (tso=0x7ffff6708d60, ret=0x0, pcap=0x7fffffffc2c0) at rts/Schedule.c:2346
#7 0x000000000423e9b4 in rts_evalLazyIO (cap=0x7fffffffc2c0, p=0x477f850, ret=0x0) at rts/RtsAPI.c:500
#8 0x0000000004241666 in real_main () at rts/RtsMain.c:63
#9 0x0000000004241759 in hs_main (argc=237, argv=0x7fffffffc448, main_closure=0x477f850, rts_config=...)
at rts/RtsMain.c:114
#10 0x0000000000408ea7 in main ()
Thanks to Luite Stegeman for helping me with debugging. We could reproduce this error on 64bit Linux and 64bit OS X. I don’t know how to track down this bug but it just made making progress in my project impossible.
EDIT: I just found a workaround: run GHCJS with +RTS -G1
. I was reading code of functions in the backtrace and I realized some RTS parameters are used. By playing with them changing randomly, I found this. Some details about this parameter is explained in GHC Trac GC section(see “generational gc” parts). Unfortunately with this RTS parameter GHCJS runs at least 100x slower(at least in debug build) so even though I can boot GHCJS with my changes it’s not very usable since I sometimes boot it 10 times a day in this stage of the development.
EDIT 2: My workaround did not work great – now compilation is failing in some other stage:
Failed to install base-4.7.0.0
Last 10 lines of the build log ( /home/omer/.cabal/logs/base-4.7.0.0.log ):
[ 4 of 202] Compiling Unsafe.Coerce ( Unsafe/Coerce.hs, dist/build/Unsafe/Coerce.js_p_o )
[ 5 of 202] Compiling GHC.Unicode[boot] ( GHC/Unicode.hs-boot, dist/build/GHC/Unicode.js_p_o-boot )
[ 6 of 202] Compiling GHC.IO[boot] ( GHC/IO.hs-boot, dist/build/GHC/IO.js_p_o-boot )
[ 7 of 202] Compiling GHC.Exception[boot] ( GHC/Exception.lhs-boot, dist/build/GHC/Exception.js_p_o-boot )
[ 8 of 202] Compiling GHC.Err ( GHC/Err.lhs, dist/build/GHC/Err.js_p_o )
[ 9 of 202] Compiling GHC.Base ( GHC/Base.lhs, dist/build/GHC/Base.js_p_o )
[ 10 of 202] Compiling Data.Maybe ( Data/Maybe.hs, dist/build/Data/Maybe.js_p_o )
[ 11 of 202] Compiling GHC.List ( GHC/List.lhs, dist/build/GHC/List.js_p_o )
[ 12 of 202] Compiling GHC.Num ( GHC/Num.lhs, dist/build/GHC/Num.js_p_o )
[ 13 of 202] Compiling GHC.Show ( GHC/Show.lhs, dist/build/GHC/Show.js_p_o )
cabal: Error: some packages failed to install:
base-4.7.0.0 failed during the building phase. The exception was:
ExitFailure (-11)
However, when I tried to run the command that led to this segfault in gdb to see if it gives a different backtrace, it worked fine and no segfaults happened.
EDIT 3: There has been some changes in GHCJS codegen and I wanted to try to reproduce this bug using that new codegen, hoping that it may be a workaround for the segfault(I rebased my patches for new version). It’s still happening, but now in different file:
[112 of 202] Compiling System.Posix.Types ( System/Posix/Types.hs, dist/build/System/Posix/Types.js_p_o )
Program received signal SIGSEGV, Segmentation fault.
0x0000000004254434 in LOOKS_LIKE_CLOSURE_PTR (p=0x0) at includes/rts/storage/ClosureMacros.h:258
258 includes/rts/storage/ClosureMacros.h: No such file or directory.
(gdb) bt
#0 0x0000000004254434 in LOOKS_LIKE_CLOSURE_PTR (p=0x0) at includes/rts/storage/ClosureMacros.h:258
#1 0x00000000042565e6 in scavenge_mutable_list1 (bd=0x7fffe5800c40, gen=0x4d17d48) at rts/sm/Scav.c:1400
#2 0x0000000004256883 in scavenge_capability_mut_Lists1 (cap=0x4cf6340 <MainCapability>) at rts/sm/Scav.c:1493
#3 0x000000000424d9d6 in GarbageCollect (collect_gen=0, do_heap_census=rtsFalse, gc_type=2,
cap=0x4cf6340 <MainCapability>) at rts/sm/GC.c:342
#4 0x000000000423c313 in scheduleDoGC (pcap=0x7fffffffc198, task=0x4d2ab60, force_major=rtsFalse)
at rts/Schedule.c:1650
#5 0x000000000423ac54 in schedule (initialCapability=0x4cf6340 <MainCapability>, task=0x4d2ab60)
at rts/Schedule.c:553
#6 0x000000000423d2a6 in scheduleWaitThread (tso=0x7ffff6708d60, ret=0x0, pcap=0x7fffffffc2c0) at rts/Schedule.c:2346
#7 0x0000000004235824 in rts_evalLazyIO (cap=0x7fffffffc2c0, p=0x4776850, ret=0x0) at rts/RtsAPI.c:500
#8 0x00000000042384d6 in real_main () at rts/RtsMain.c:63
#9 0x00000000042385c9 in hs_main (argc=237, argv=0x7fffffffc448, main_closure=0x4776850, rts_config=...)
at rts/RtsMain.c:114
#10 0x0000000000408ea7 in main ()
Backtrace is same. I think it’s interesting that this problem is happening while compiling Types
modules in 2/3 of the cases(Foreign.C.Types
and System.Posix.Types
). As far as I can see this two modules are using INTEGRAL_TYPE
extensively, which creates a newtype and implements Read
and Show
instances. Other problematic module, GHC.Show
also implements same instances, but does that manually instead of calling INTEGRAL_TYPE
. I wonder if it could be related with that.
EDIT 4: When I run GHCJS using --no-native
, it failed with an assertion error rather than a segfault:
[112 of 202] Compiling System.Posix.Types ( System/Posix/Types.hs, dist/build/System/Posix/Types.js_p_o )
ghcjs: internal error: ASSERTION FAILED: file rts/sm/Scav.c, line 1400
(GHC version 7.8.2 for x86_64_unknown_linux)
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
Program received signal SIGABRT, Aborted.
0x00007ffff687f849 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff687f849 in raise () from /lib64/libc.so.6
#1 0x00007ffff6880cd8 in abort () from /lib64/libc.so.6
#2 0x0000000004238a27 in rtsFatalInternalErrorFn (s=0x4554e60 "ASSERTION FAILED: file %s, line %u\n",
ap=0x7fffffffbe58) at rts/RtsMessages.c:170
#3 0x000000000423865f in barf (s=0x4554e60 "ASSERTION FAILED: file %s, line %u\n") at rts/RtsMessages.c:42
#4 0x00000000042386c2 in _assertFail (filename=0x4559fbd "rts/sm/Scav.c", linenum=1400) at rts/RtsMessages.c:57
#5 0x00000000042565e9 in scavenge_mutable_list1 (bd=0x7fffe7402dc0, gen=0x4d15d88) at rts/sm/Scav.c:1400
#6 0x0000000004256873 in scavenge_capability_mut_Lists1 (cap=0x4cf49c0 <MainCapability>) at rts/sm/Scav.c:1493
#7 0x000000000424d9c6 in GarbageCollect (collect_gen=0, do_heap_census=rtsFalse, gc_type=2,
cap=0x4cf49c0 <MainCapability>) at rts/sm/GC.c:342
#8 0x000000000423c303 in scheduleDoGC (pcap=0x7fffffffc188, task=0x4d28ba0, force_major=rtsFalse)
at rts/Schedule.c:1650
#9 0x000000000423ac44 in schedule (initialCapability=0x4cf49c0 <MainCapability>, task=0x4d28ba0)
at rts/Schedule.c:553
#10 0x000000000423d296 in scheduleWaitThread (tso=0x7ffff6708d60, ret=0x0, pcap=0x7fffffffc2b0) at rts/Schedule.c:2346
#11 0x0000000004235814 in rts_evalLazyIO (cap=0x7fffffffc2b0, p=0x4776850, ret=0x0) at rts/RtsAPI.c:500
#12 0x00000000042384c6 in real_main () at rts/RtsMain.c:63
#13 0x00000000042385b9 in hs_main (argc=238, argv=0x7fffffffc438, main_closure=0x4776850, rts_config=...)
at rts/RtsMain.c:114
#14 0x0000000000408ea7 in main ()
Not passing --not-native
is resulting with a segfault and same backtrace as before.