https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Bug ID: 1240487 Summary: erl segfault on fedora-23-i686 (autoconf testsuite) Product: Fedora Version: rawhide Component: erlang Assignee: lemenkov@gmail.com Reporter: praiskup@redhat.com QA Contact: extras-qa@fedoraproject.org CC: erlang@lists.fedoraproject.org, lemenkov@gmail.com, rhbugs@n-dimensional.de, s@shk.io
We observe autoconf FTBFS on rawhide (testsuite failures). One of the testsuite failures is related to Erlang & autoconf, but it appears only on i686. I tried to cut related testcase out into segfault-i686.tar.gz reproducer:
$ tar -xf segfault-i686.tar.gz $ cd segfault-i686 $ make && make run erlc -b beam my_testsuite.erl cd lib && ./compile erl -pa ./lib -s my_testsuite test Erlang/OTP 17 [erts-6.3] [source] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V6.3 (abort with ^G) 1> All 3 tests passed. Makefile:6: recipe for target 'run' failed make: *** [run] Segmentation fault (core dumped)
The segfault ^^ breaks autoconf testsuite, but I'm not able to diagnose properly. Any help appreciated, let me know if you need some other info.
FTBFS: https://kojipkgs.fedoraproject.org//work/tasks/7404/10217404/build.log
Pavel
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Pavel Raiskup praiskup@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Blocks| |1236072
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1236072 [Bug 1236072] FTBFS - two test cases failed on i686
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #1 from Pavel Raiskup praiskup@redhat.com --- Created attachment 1049095 --> https://bugzilla.redhat.com/attachment.cgi?id=1049095&action=edit Reproducer
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Randy Barlow rbarlow@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |rbarlow@redhat.com
--- Comment #3 from Randy Barlow rbarlow@redhat.com --- I have also been experiencing this bug, and am unfortunately unable to run the test suites on my packages for i686. I see this issue on Rawhide.
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Peter Lemenkov lemenkov@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED
--- Comment #4 from Peter Lemenkov lemenkov@gmail.com --- Ok, now I've got the same issues both in F-23 and in Rawhide. The latest failure is here:
http://koji.fedoraproject.org/koji/taskinfo?taskID=12592117
Surprisingly but I can see them only in Koji. If I run build manually (with rpmbuild) everything is fine.
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #5 from Peter Lemenkov lemenkov@gmail.com --- Unfortunately this reproducer doesn't reproduce the issue on my machine. Everything is fine:
[petro@fedora32i686 segfault-i686]$ make run erl -pa ./lib -s my_testsuite test Erlang/OTP 18 [erts-7.2.1] [source] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V7.2.1 (abort with ^G) 1> All 3 tests passed. [petro@fedora32i686 segfault-i686]$
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Peter Lemenkov lemenkov@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |praiskup@redhat.com Flags| |needinfo?(praiskup@redhat.c | |om)
--- Comment #6 from Peter Lemenkov lemenkov@gmail.com --- (In reply to Pavel Raiskup from comment #0)
We observe autoconf FTBFS on rawhide (testsuite failures). One of the testsuite failures is related to Erlang & autoconf, but it appears only on i686. I tried to cut related testcase out into segfault-i686.tar.gz reproducer:
$ tar -xf segfault-i686.tar.gz $ cd segfault-i686 $ make && make run erlc -b beam my_testsuite.erl cd lib && ./compile erl -pa ./lib -s my_testsuite test Erlang/OTP 17 [erts-6.3] [source] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V6.3 (abort with ^G) 1> All 3 tests passed. Makefile:6: recipe for target 'run' failed make: *** [run] Segmentation fault (core dumped)
The segfault ^^ breaks autoconf testsuite, but I'm not able to diagnose properly. Any help appreciated, let me know if you need some other info.
FTBFS: https://kojipkgs.fedoraproject.org//work/tasks/7404/10217404/build.log
Pavel
Pavel, I've just checked - the issue is still there. Unfortunately I can't reproduce it on my hardware (KVM VM) - only in Fedora Koji. Do you have an access to the machine where it's possible to reproduce the issue?
I really don't have any clue on what's going on there?
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Pavel Raiskup praiskup@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(praiskup@redhat.c | |om) |
--- Comment #7 from Pavel Raiskup praiskup@redhat.com --- I'm not able to reproduce this now.
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #8 from Randy Barlow rbarlow@redhat.com --- Hello Peter!
Is it possible that the recent update to Erlang 18 fixed this issue?
P.S. Now we really have to get ejabberd updated, as it doesn't seem to work with Erlang 18 ☺ If you have some time, jcline and I have a few package review requests waiting. We CAN review each other's if necessary, but we'd rather that someone with more Erlang experience than we have review them if you or anyone else has the time. Oh, if we only had more time, right?
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #9 from Peter Lemenkov lemenkov@gmail.com --- (In reply to Randy Barlow from comment #8)
Hello Peter!
Is it possible that the recent update to Erlang 18 fixed this issue?
Randy, it's certainly not fixed yet. And I'm afraid this issue has something with Koji buildsystem itself (hardware + software + configuration) rather that with Erlang itself.
I'm still trying to find an Erlang-related issue but I failed to reproduce it anywhere (with Erlang on a native i686 Rawhide, with mockbuilds for Rawhide at RHEL6/RHEL7) on machines available to me.
The only place where I can reproduce this issue with 100% reproducibility is Fedora Koji Buildsystem. This makes me very suspicious.
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #10 from Randy Barlow rbarlow@redhat.com --- Hi Peter!
Interesting, working on problems that are hard to reproduce is tricky. I am sad to say that I am out of ideas. If you can think of a way I can assist, I am happy to!
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Peter Lemenkov lemenkov@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(filip@andresovi.n | |et)
--- Comment #11 from Peter Lemenkov lemenkov@gmail.com --- Filip, you mentioned in bug 1221824#c20 that you have a reproducer. Could you please run it again with strace or gdb attached? We really need your help here. :)
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Filip Andres filip@andresovi.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(filip@andresovi.n | |et) |
--- Comment #12 from Filip Andres filip@andresovi.net --- Hi Peter, sure, will do in the evening, when I get to my fedora box.
f.
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #13 from Filip Andres filip@andresovi.net --- Hi, I have been commenting into the other issue (https://bugzilla.redhat.com/show_bug.cgi?id=1221824), sorry :-) Copying the most important parts here:
* strace -- useless, the VM crashes in userspace (https://bugzilla.redhat.com/attachment.cgi?id=1116279)
* gdb stracktrace
(gdb) bt #0 0x56798d70 in ethr_dw_atomic_cmpxchg () at ../include/internal/i386/atomic.h:177 #1 0x566103ce in ethr_dw_atomic_cmpxchg_nob (xchg=0xf4e0609c, new=0xf4e060a4, var=0x568688f0 <erts_proc+48>) at beam/erl_threads.h:1456 #2 erts_atomic64_inc_read_nob (var=0x568688f0 <erts_proc+48>) at beam/erl_threads.h:1646 #3 step_interval_nob (icp=0x568688f0 <erts_proc+48>) at beam/utils.c:4954 #4 erts_smp_step_interval_nob (icp=icp@entry=0x568688f0 <erts_proc+48>) at beam/utils.c:5004 #5 0x5671572b in ptab_list_bif_engine (c_p=c_p@entry=0xf6dc0218, res_accp=res_accp@entry=0xf4e06178, mbp=mbp@entry=0xf1f80a88) at beam/erl_ptab.c:927 #6 0x56716a5d in erts_ptab_list (c_p=c_p@entry=0xf6dc0218, ptab=0x568688c0 <erts_proc>) at beam/erl_ptab.c:766 #7 0x5661be76 in processes_0 (A__p=0xf6dc0218, BIF__ARGS=0xf7483100) at beam/bif.c:3841 #8 0x5659978b in process_main () at beam/beam_emu.c:3690 #9 0x56638784 in sched_thread_func (vesdp=0xf6087dc0) at beam/erl_process.c:8021 #10 0x567a19cc in thr_wrapper (vtwd=0xffffd1b4) at pthread/ethread.c:114 #11 0xf7f164be in start_thread (arg=0xf4e06b40) at pthread_create.c:333 #12 0xf7e2a3fe in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:114
* the problem seems to be triggered by the i686 build using the -mtune=atom flag, I tried the following change and the resulting binary doesn't have the same problem:
%ifarch %{ix86} %global optflags -mtune=generic %endif
Build: http://koji.fedoraproject.org/koji/taskinfo?taskID=12621253
Now the erlang:processes() command executes successfully:
$ mock -r fedora-rawhide-i386 --no-clean --shell INFO: mock.py version 1.2.14 starting (python version = 3.4.2)... Start: init plugins INFO: selinux enabled Finish: init plugins Start: run Start: chroot init INFO: calling preinit hooks INFO: enabled root cache INFO: enabled dnf cache Start: cleaning dnf metadata Finish: cleaning dnf metadata INFO: enabled ccache Finish: chroot init Start: shell <mock-chroot>sh-4.3# erl Erlang/OTP 18 [erts-7.2.1] [source] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V7.2.1 (abort with ^G) 1> erlang:processes(). [<0.0.0>,<0.3.0>,<0.6.0>,<0.7.0>,<0.9.0>,<0.10.0>,<0.11.0>, <0.12.0>,<0.14.0>,<0.15.0>,<0.16.0>,<0.17.0>,<0.18.0>, <0.20.0>,<0.21.0>,<0.22.0>,<0.23.0>,<0.24.0>,<0.25.0>, <0.26.0>,<0.27.0>,<0.28.0>,<0.29.0>,<0.30.0>,<0.34.0>] 2>
Resume: There seem to be an error in the fallback implementation of ethr_dw_atomic_cmpxchg. I'm not sure whether these binaries would run on an Atom processor though (and I don't have means to test it). I guess I may ask in the erlang-bugs mailing list but I would let it to you to decide if building for generic processor (instead of Atom) is a viable workaround or not.
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #14 from Peter Lemenkov lemenkov@gmail.com --- Found a way to see actual stacktrace.
Run erl in GDB as shown above. Then when you got a SIGSEGV you will have a corrupted stack. First we need to recover it by adding/removing random values to/from $esp register (stack pointer). I believe those who know Intel assembly already know what values one should try first. I tried stepping by 4 in each direction until I realized that I have to add 32. So, please, do:
(gdb) set $pc = *(void **)$esp (gdb) set $esp = $esp + 32 (gdb) bt #0 0x568688f0 in erts_proc () #1 0x566103ce in ethr_dw_atomic_cmpxchg_nob (xchg=0xf461609c, new=0xf46160a4, var=0x568688f0 <erts_proc+48>) at beam/erl_threads.h:1456 #2 erts_atomic64_inc_read_nob (var=0x568688f0 <erts_proc+48>) at beam/erl_threads.h:1646 #3 step_interval_nob (icp=0x568688f0 <erts_proc+48>) at beam/utils.c:4954 #4 erts_smp_step_interval_nob (icp=icp@entry=0x568688f0 <erts_proc+48>) at beam/utils.c:5004 #5 0x5671572b in ptab_list_bif_engine (c_p=c_p@entry=0xf6d80218, res_accp=res_accp@entry=0xf4616178, mbp=mbp@entry=0xf1f816a0) at beam/erl_ptab.c:927 #6 0x56716a5d in erts_ptab_list (c_p=c_p@entry=0xf6d80218, ptab=0x568688c0 <erts_proc>) at beam/erl_ptab.c:766 #7 0x5661be76 in processes_0 (A__p=0xf6d80218, BIF__ARGS=0xf74861c0) at beam/bif.c:3841 #8 0x5659978b in process_main () at beam/beam_emu.c:3690 #9 0x56638784 in sched_thread_func (vesdp=0xf608e000) at beam/erl_process.c:8021 #10 0x567a19cc in thr_wrapper (vtwd=0xffffd184) at pthread/ethread.c:114 #11 0xf7f184be in start_thread (arg=0xf4616b40) at pthread_create.c:333 #12 0xf7e2c3fe in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:114 (gdb)
See - a cool nice stacktrace! erts_proc is a bogus value. It's a stack corruption after calling ethr_dw_atomic_cmpxchg_nob.
That's all I've got for today.
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Peter Lemenkov lemenkov@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- URL| |http://bugs.erlang.org/brow | |se/ERL-80
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #15 from Peter Lemenkov lemenkov@gmail.com --- Possible workaround:
https://github.com/erlang/otp/commit/fd7fa46
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #16 from Peter Lemenkov lemenkov@gmail.com --- Fixed in Rawhide already. Will do builds for (both affected) F22 and F23 shortly.
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #17 from Fedora Update System updates@fedoraproject.org --- erlang-17.4-6.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-a79a47efb0
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Fedora Update System updates@fedoraproject.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |MODIFIED
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #18 from Fedora Update System updates@fedoraproject.org --- erlang-17.4-6.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-18e2827992
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #19 from Randy Barlow rbarlow@redhat.com --- Peter, thanks so much for looking into this difficult issue. You are the man!
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Fedora Update System updates@fedoraproject.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|MODIFIED |ON_QA
--- Comment #20 from Fedora Update System updates@fedoraproject.org --- erlang-17.4-6.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-18e2827992
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #21 from Fedora Update System updates@fedoraproject.org --- erlang-17.4-6.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-a79a47efb0
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #22 from Fedora Update System updates@fedoraproject.org --- erlang-17.4-6.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
--- Comment #23 from Fedora Update System updates@fedoraproject.org --- erlang-17.4-6.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Fedora Update System updates@fedoraproject.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|ON_QA |CLOSED Fixed In Version| |erlang-17.4-6.fc22 Resolution|--- |ERRATA Last Closed| |2016-02-20 21:21:33
https://bugzilla.redhat.com/show_bug.cgi?id=1240487
Fedora Update System updates@fedoraproject.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Fixed In Version|erlang-17.4-6.fc22 |erlang-17.4-6.fc22 | |erlang-17.4-6.fc23
erlang@lists.fedoraproject.org