Hi, I tried to investigate a rawhide bug: https://bugzilla.redhat.com/show_bug.cgi?id=2253099 which is about Evolution being killed "by something". That's the thing, I do not know what killed it, thus even why it had been killed. It's even not killed after certain steps, it's killed "randomly", on various occasions.
I did search the internet, but they usually expect the killer is the OOM service, which logs about the action either in the dmesg or in the journal, but in this case there is no sign about whom killed it in either of these logs.
The evolution terminal just says:
Killed
and the `dmesg | grep -i kill` only shows:
[ 3.435200] systemd[1]: Listening on systemd-oomd.socket - Userspace Out-Of-Memory (OOM) Killer Socket. [ 6.051242] rfkill: input handler disabled [ 22.579276] rfkill: input handler enabled [ 23.539350] rfkill: input handler disabled
and `journalctl -xb | grep -i kill` doesn't reveal anything useful. The `journalctl -xb | grep -i evolution` has as its last line:
Jan 31 10:49:22 localhost.localdomain rtkit-daemon[826]: Successfully made thread 4820 of process 4640 (/usr/bin/evolution) owned by '1000' RT at priority 5.
which does not explain anything. Grepping for the 4640 (evo's PID) didn't show anything either.
That being said, is there a way to figure out what killed the app, please?
Thanks and bye, Milan
P.S.: the above-pasted journalctl line is followed by these three, which look suspicious, but maybe they are unrelated. I even do not know whether they had been added immediately after the app was killed or before it. At least the terminal should not matter, because evo is killed when being started from the GUI too. Three log lines follow:
Jan 31 10:49:23 localhost.localdomain xdg-desktop-por[2697]: Realtime error: Could not map pid: Could not determine pid namespace: Could not find instance-id in process's /.flatpak-info
Jan 31 10:50:46 localhost.localdomain gnome-terminal-[2939]: void terminal_screen_shell_preexec(VteTerminal*): assertion '!priv->between_preexec_and_precmd' failed
Jan 31 10:50:49 localhost.localdomain gnome-terminal-[2939]: void terminal_screen_shell_preexec(VteTerminal*): assertion '!priv->between_preexec_and_precmd' failed
On 31/01/2024 10:08, Milan Crha wrote:
I tried to investigate a rawhide bug: https://bugzilla.redhat.com/show_bug.cgi?id=2253099 which is about Evolution being killed "by something". That's the thing, I do not know what killed it, thus even why it had been killed. It's even not killed after certain steps, it's killed "randomly", on various occasions.
I did search the internet, but they usually expect the killer is the OOM service, which logs about the action either in the dmesg or in the journal, but in this case there is no sign about whom killed it in either of these logs.
The evolution terminal just says:
Killed
At the end of of the day it means a SIGKILL was sent to the process and that's not something that is logged anywhere as a matter of course so you're reliant on whatever sends it saying so.
You're right that OOM is the usual cause so if you've ruled that out you need to think about other things.
The problem is that SIGKILL is deliberately a very hard stop that nothing can trap so normal things like using strace or gdb to catch who went it aren't going to work.
Tom
On Wed, Jan 31, 2024 at 11:06:02AM +0000, Tom Hughes via devel wrote:
On 31/01/2024 10:08, Milan Crha wrote:
I tried to investigate a rawhide bug: https://bugzilla.redhat.com/show_bug.cgi?id=2253099 which is about Evolution being killed "by something". That's the thing, I do not know what killed it, thus even why it had been killed. It's even not killed after certain steps, it's killed "randomly", on various occasions.
I did search the internet, but they usually expect the killer is the OOM service, which logs about the action either in the dmesg or in the journal, but in this case there is no sign about whom killed it in either of these logs.
The evolution terminal just says:
Killed
At the end of of the day it means a SIGKILL was sent to the process and that's not something that is logged anywhere as a matter of course so you're reliant on whatever sends it saying so.
You're right that OOM is the usual cause so if you've ruled that out you need to think about other things.
The problem is that SIGKILL is deliberately a very hard stop that nothing can trap so normal things like using strace or gdb to catch who went it aren't going to work.
The audit subsystem is probably the first choice to find out what's killing it. Other than that, systemtap or eBPF scripts can be written to trace this.
With regards, Daniel
V Wed, Jan 31, 2024 at 11:08:16AM +0100, Milan Crha napsal(a):
Hi, I tried to investigate a rawhide bug: https://bugzilla.redhat.com/show_bug.cgi?id=2253099 which is about Evolution being killed "by something". That's the thing, I do not know what killed it, thus even why it had been killed. It's even not killed after certain steps, it's killed "randomly", on various occasions.
I did search the internet, but they usually expect the killer is the OOM service, which logs about the action either in the dmesg or in the journal, but in this case there is no sign about whom killed it in either of these logs.
This procedure works for me https://stackoverflow.com/questions/2827877/tracing-unix-signal-origins. The tracefs file system has a nice log. An example of /usr/bin/sleep receiving SIGKILL from bash:
root@fedora-40:/sys/kernel/debug/tracing # cat trace # tracer: nop # # entries-in-buffer/entries-written: 1/1 #P:4 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | bash-4867 [002] d..1. 10571.864270: signal_generate: sig=9 errno=0 code=0 comm=sleep pid=4866 grp=1 res=0
-- Petr
On Wed, 2024-01-31 at 12:19 +0100, Petr Pisar wrote:
This procedure works for me https://stackoverflow.com/questions/2827877/tracing-unix-signal-origins. The tracefs file system has a nice log.
Hi, that works pretty well, thank you. Funny enough, if I read it correctly, it says evo killed itself (the first three bwrap-s are probably unrelated, it could be when I opened and closed a composer, which let WebKitGTK create a new WebKitWebProcess; the last line might be when WebKitGTK cleans up after itself, or all the bwraps are at last created by the WebKitGTK):
bwrap-3756 [002] d..2. 341.677197: signal_generate: sig=9 errno=0 code=0 comm=bwrap pid=3757 grp=1 res=1 bwrap-3753 [000] d..2. 341.677296: signal_generate: sig=9 errno=0 code=0 comm=bwrap pid=3754 grp=1 res=1 bwrap-3856 [002] d..2. 345.231590: signal_generate: sig=9 errno=0 code=0 comm=bwrap pid=3857 grp=1 res=1 evolution-3211 [003] d..1. 355.904404: signal_generate: sig=9 errno=0 code=128 comm=evolution pid=3211 grp=1 res=0 evolution-3211 [003] d..2. 355.904450: signal_generate: sig=9 errno=0 code=0 comm=bwrap pid=3257 grp=1 res=0
In such case, I might be able to catch this in gdb, right? Maybe with a breakpoint in the `kill` function, and any other? These things are very low-level for me, I'm sorry.
Bye, Milan
On Wed, 2024-01-31 at 13:07 +0100, Milan Crha wrote:
In such case, I might be able to catch this in gdb, right? Maybe with a breakpoint in the `kill` function, and any other?
Okay, I tried with the following (more variants, just in case):
gdb evolution \ --ex "b kill" \ --ex "b exit" \ --ex "b __kill" \ --ex "b _kill" \ --ex "b _exit" \ --ex "b pidfd_send_signal" \ --ex "b tkill if sig==9" \ --ex r
and it did not catch any of these. The last lines are:
[Thread 0x7fffd92006c0 (LWP 3086) exited] [New process 3080]
Program terminated with signal SIGKILL, Killed. The program no longer exists. (gdb)
which may or may not mean it is sent by a different function, I guess. I cannot tell how much time was between the kill signal and the "[New process 3080]" line.
Bye, Milan
V Wed, Jan 31, 2024 at 01:07:42PM +0100, Milan Crha napsal(a):
evolution-3211 [003] d..1. 355.904404: signal_generate: sig=9 errno=0 code=128 comm=evolution pid=3211 grp=1 res=0 evolution-3211 [003] d..2. 355.904450: signal_generate: sig=9 errno=0 code=0 comm=bwrap pid=3257 grp=1 res=0
Key information is code=128. That code is probably si_code member described in sigaction(2). The manual lists a lot of values as constants. 128 value is SI_KERNEL according to /usr/include/asm-generic/siginfo.h. It is documented as "sent by the kernel from somewhere". It means the first signal does not come a user space. It's genuinly generated by kernel. E.g. when the process is killed for out-of-memory reason. However, in this case, I would expect kernel to log this event into dmesg. E.g. out-of-memory killer is very verbose in dmesg.
Strange thing is the second line. It reads that process 3211 (code=0 meads SI_USER, "sent by kill, sigsend, raise") sent SIGKILL to 3257. It's questionable how a process 3211 killed with the first signal can still emit signals. A possible explanation is that the signals are procecessed asychnously and evolution manages to dispatch the signal to bwrap before kernel termites evolution because of the first signal. Or I misinterpret the log.
-- Petr
Hey,
On 31. Jan 2024, at 16:24, Petr Pisar ppisar@redhat.com wrote:
Key information is code=128. That code is probably si_code member described in sigaction(2). The manual lists a lot of values as constants. 128 value is SI_KERNEL according to /usr/include/asm-generic/siginfo.h. It is documented as "sent by the kernel from somewhere". It means the first signal does not come a user space. It's genuinly generated by kernel. E.g. when the process is killed for out-of-memory reason. However, in this case, I would expect kernel to log this event into dmesg. E.g. out-of-memory killer is very verbose in dmesg.
Strange thing is the second line. It reads that process 3211 (code=0 meads SI_USER, "sent by kill, sigsend, raise") sent SIGKILL to 3257. It's questionable how a process 3211 killed with the first signal can still emit signals. A possible explanation is that the signals are procecessed asychnously and evolution manages to dispatch the signal to bwrap before kernel termites evolution because of the first signal. Or I misinterpret the log.
Throwing some ideas out there, is it possible that evolution runs with a seccomp filter or other BPF program configured to kill the process on violation, and that’s what’s happening here?
For software that regularly deals with untrusted input, it doesn’t seem unreasonable that the developers might have implemented something like that, and a seccomp kill might actually look like that.
On Wed, Jan 31 2024 at 04:42:08 PM +01:00:00, Clemens Lang cllang@redhat.com wrote:
Throwing some ideas out there, is it possible that evolution runs with a seccomp filter or other BPF program configured to kill the process on violation, and that’s what’s happening here?
I don't think so. flatpak does use seccomp filters [1], but on violations the syscalls should just fail with EPERM or ENOSYS, not kill the process.
[1] https://github.com/flatpak/flatpak/blob/d5f891e0035e50b24211688a7fa5f61924a2...
On Wed, 31 Jan 2024 at 12:00, Michael Catanzaro mcatanzaro@redhat.com wrote:
On Wed, Jan 31 2024 at 04:42:08 PM +01:00:00, Clemens Lang cllang@redhat.com wrote:
Throwing some ideas out there, is it possible that evolution runs with a seccomp filter or other BPF program configured to kill the process on violation, and that’s what’s happening here?
I don't think so. flatpak does use seccomp filters [1], but on violations the syscalls should just fail with EPERM or ENOSYS, not kill the process.
Could the problem with evolution be really with something else and getting masked out ? I saw there was another thread about bogofilter killing evolution due to the database needing to change from berkeley to sqllite by running a helper. If the helper is being run but it runs out of memory, would that cause the whole chain of OOM ?
[1]
https://github.com/flatpak/flatpak/blob/d5f891e0035e50b24211688a7fa5f61924a2...
-- _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
On Wed, 2024-01-31 at 16:24 +0100, Petr Pisar wrote:
A possible explanation is that the signals are procecessed asychnously and evolution manages to dispatch the signal to bwrap before kernel termites evolution because of the first signal. Or I misinterpret the log.
Hi, there had been more lines involving bwrap and WebKitWebProcess, if I recall correctly (I closed the machine already and did not save the full log, I'm sorry). The above lines were those directly involving the evolution process. I did not want to spam the list with a too long log. Bye, Milan
SIGKILL is almost always sent by systemd-oomd (or the kernel OOM killer). That's the most likely explanation. Theoretically it could also be sent by systemd if a service didn't quit quickly enough following a SIGTERM. Maybe it could also be sent by mutter if a program is unresponsive?
WebKitGTK doesn't use SIGKILL to clean up after itself. The parent process just closes its IPC socket to the child process, then the child should either (a) exit normally, or (b) crash itself, if its watchdog thread detects the main thread has hung.
On Wed, Jan 31 2024 at 11:08:16 AM +01:00:00, Milan Crha mcrha@redhat.com wrote:
Jan 31 10:49:23 localhost.localdomain xdg-desktop-por[2697]: Realtime error: Could not map pid: Could not determine pid namespace: Could not find instance-id in process's /.flatpak-info
This one is https://bugs.webkit.org/show_bug.cgi?id=238403 (fixed recently)
On Wed, 2024-01-31 at 08:31 -0600, Michael Catanzaro wrote:
Maybe it could also be sent by mutter if a program is unresponsive?
Hi, the app is perfectly responsive. I click on a widget and the app is killed immediately. There is no freeze of the app.
WebKitGTK doesn't use SIGKILL to clean up after itself.
Evo itself doesn't use any seccomp or such, these things can be used by the WebKitGTK. A quick grep revealed:
https://github.com/WebKit/WebKit/blob/main/Source/WebKit/UIProcess/Launcher/...
but that code does not seem to be called at this time (or my debug code was wrong, it's possible).
There was an update on the bug #2253099 that also liferea is affected, thus at least evo is not alone. There is suspect that either glib2 or kernel update caused it, but who knows. Bye, Milan
On Wed, Jan 31 2024 at 06:53:25 PM +01:00:00, Milan Crha mcrha@redhat.com wrote:
Evo itself doesn't use any seccomp or such, these things can be used by the WebKitGTK. A quick grep revealed:
https://github.com/WebKit/WebKit/blob/main/Source/WebKit/UIProcess/Launcher/...
but that code does not seem to be called at this time (or my debug code was wrong, it's possible).
That code is for terminating child processes. It's certainly not supposed to terminate itself.
I think your options are two:
1) Write a plugin for the oom service to capture the state (and log it) before it kills the process. Here's the source code to learn more about that:
https://github.com/facebookincubator/oomd/
2) Or you can completely disable it:
https://www.cjjackson.dev/posts/what-is-systemd-oomd-how-to-disable-it/
The man pages which other folks referred to is here (in case you want to tweak with configuring it):
https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html
Hope it helps, ~p
On Wed, Jan 31, 2024 at 1:04 PM Michael Catanzaro mcatanzaro@redhat.com wrote:
On Wed, Jan 31 2024 at 06:53:25 PM +01:00:00, Milan Crha mcrha@redhat.com wrote:
Evo itself doesn't use any seccomp or such, these things can be used by the WebKitGTK. A quick grep revealed:
https://github.com/WebKit/WebKit/blob/main/Source/WebKit/UIProcess/Launcher/...
but that code does not seem to be called at this time (or my debug code was wrong, it's possible).
That code is for terminating child processes. It's certainly not supposed to terminate itself.
-- _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
On Wed, Jan 31, 2024, at 11:40 AM, Paul Grosu wrote:
https://github.com/facebookincubator/oomd/
- Or you can completely disable it:
https://www.cjjackson.dev/posts/what-is-systemd-oomd-how-to-disable-it/
We need bug reports to get it fixed rather than disabling it, if it's causing things to get clobbered that shouldn't be.
The kernel OOM killer is strictly concerned about kernel survival in dramatically low memory situations. It is unconcerned about user space making progress. Hence user space oom managers.
I admit this is probably suboptimal in some cases. Full resource control is a better way of handling the ambiguity whether a program is incorrectly hogging resources versus simply doing its job. Resource control may not obviate a user space oom killer but I think what we really care about is giving a program all the resources it needs, up to the point that the user wants an interactive desktop - and then the kernel needs to (in effect) preempt the resource hungry processes in favor of user-desktop interactivity. Resource control can do this but we don't have everything wired up yet, more work is needed.
-- Chris Murphy
On Wed, 2024-01-31 at 13:40 -0500, Paul Grosu wrote:
- Or you can completely disable it:
Hi, I do not want to disable the oom service. Remember, it does that on user machines, not only on mine. Telling people: "you want to use app a, b, c, then disable oom" as a new Fedora 40 feature is not great ;)
Nonetheless, I though I'll give it a try and will trigger the oom, to see what it does. I wrote a little app to allocate gigabytes of memory (until malloc() returns NULL) and one process was not enough, thus I ran second and then third and then I've been notified by a gnome-shell bubble telling me "Virtual Terminal Stopped" with a reason that memory is almost full. Looking on the terminals the second process I ran had been killed and the `dmesg` shows a reason:
[ 1712.960372] oom- kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0 ,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/ap p.slice/app-org.gnome.Terminal.slice/vte-spawn-7f4b1469-38d1-498f-b363- 1c8311d2fcff.scope,task=alloc-too-much,pid=3323,uid=1000
[ 1712.960389] Out of memory: Killed process 3323 (alloc-too-much) total-vm:137300925912kB, anon-rss:128kB, file-rss:1280kB, shmem- rss:0kB, UID:1000 pgtables:1544608kB oom_score_adj:200
The journal contains similar information, plus a lot of debugging information from kernel, starting with:
Feb 01 08:45:47 localhost.localdomain kernel: alloc-too-much invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0,.......
The kernel tracing log for sig==9 shows:
gnome-terminal--2924 [002] dN.2. 2520.462889: signal_generate: sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
There is no such thing (apart of the tracing log) when Evolution is suddenly killed, the logs are muted. That makes me believe it's not the OOM killer whom kills the application. I'm back at square 1; or maybe square 2, as I know possibly kernel sends it, but not why.
Bye, Milan
On Thu, 1 Feb 2024 at 09:13, Milan Crha mcrha@redhat.com wrote:
The kernel tracing log for sig==9 shows:
gnome-terminal--2924 [002] dN.2. 2520.462889: signal_generate: sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
There is no such thing (apart of the tracing log) when Evolution is suddenly killed, the logs are muted. That makes me believe it's not the OOM killer whom kills the application. I'm back at square 1; or maybe square 2, as I know possibly kernel sends it, but not why.
Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as root) and then collect the kernel trace again. That should give you a backtrace of kernel functions from the signal generation, which could help you/us to figure out the reason the process was killed.
On Thu, 2024-02-01 at 09:44 +0100, Ondrej Mosnáček wrote:
On Thu, 1 Feb 2024 at 09:13, Milan Crha mcrha@redhat.com wrote:
The kernel tracing log for sig==9 shows:
gnome-terminal--2924 [002] dN.2. 2520.462889: signal_generate: sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
There is no such thing (apart of the tracing log) when Evolution is suddenly killed, the logs are muted. That makes me believe it's not the OOM killer whom kills the application. I'm back at square 1; or maybe square 2, as I know possibly kernel sends it, but not why.
Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as root) and then collect the kernel trace again. That should give you a backtrace of kernel functions from the signal generation, which could help you/us to figure out the reason the process was killed.
So, figured the easiest way to help trigger the kill here is to put load on the machine.
$ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95
then starting evolution seems to do it almost every time shortly after start (I have around 200k messages in the folder its trying to display)
I've enabled the stacktrace tracing option and like Milan set a sig==9 filter. And here is what I got in the trace buffer after it was killed
# tracer: nop # # entries-in-buffer/entries-written: 34/34 #P:16 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | evolution-9096 [002] d..1. 1207.016489: signal_generate: sig=9 errno=0 code=128 comm=evolution pid=9096 grp=1 res=0 evolution-9096 [002] d..1. 1207.016495: <stack trace> => trace_event_raw_event_signal_generate => __send_signal_locked => posix_cpu_timers_work => task_work_run => irqentry_exit_to_user_mode => asm_sysvec_apic_timer_interrupt evolution-9096 [002] d..2. 1207.016564: signal_generate: sig=9 errno=0 code=0 comm=bwrap pid=9145 grp=1 res=0 evolution-9096 [002] d..2. 1207.016568: <stack trace> => trace_event_raw_event_signal_generate => __send_signal_locked => do_send_sig_info => do_exit => do_group_exit => get_signal => arch_do_signal_or_restart => irqentry_exit_to_user_mode => asm_sysvec_apic_timer_interrupt ... and 32 other events of bwrap cleaning up.
Does that shed any light? Other that is seems to be sending the signal to itself.
-Yanko
On Fri, 2 Feb 2024 at 17:52, Yanko Kaneti yaneti@declera.com wrote:
On Thu, 2024-02-01 at 09:44 +0100, Ondrej Mosnáček wrote:
On Thu, 1 Feb 2024 at 09:13, Milan Crha mcrha@redhat.com wrote:
The kernel tracing log for sig==9 shows:
gnome-terminal--2924 [002] dN.2. 2520.462889: signal_generate: sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
There is no such thing (apart of the tracing log) when Evolution is suddenly killed, the logs are muted. That makes me believe it's not the OOM killer whom kills the application. I'm back at square 1; or maybe square 2, as I know possibly kernel sends it, but not why.
Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as root) and then collect the kernel trace again. That should give you a backtrace of kernel functions from the signal generation, which could help you/us to figure out the reason the process was killed.
So, figured the easiest way to help trigger the kill here is to put load on the machine.
$ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95
then starting evolution seems to do it almost every time shortly after start (I have around 200k messages in the folder its trying to display)
I've enabled the stacktrace tracing option and like Milan set a sig==9 filter. And here is what I got in the trace buffer after it was killed
# tracer: nop # # entries-in-buffer/entries-written: 34/34 #P:16 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | evolution-9096 [002] d..1. 1207.016489: signal_generate: sig=9 errno=0 code=128 comm=evolution pid=9096 grp=1 res=0 evolution-9096 [002] d..1. 1207.016495: <stack trace> => trace_event_raw_event_signal_generate => __send_signal_locked => posix_cpu_timers_work => task_work_run => irqentry_exit_to_user_mode => asm_sysvec_apic_timer_interrupt
So, browsing through the relevant kernel code, it seems the only cases which could have led to this backtrace are: 1. When a task's RT timeout goes over the RLIMIT_RTTIME hard limit (see function check_thread_timers() in kernel/time/posix-cpu-timers.c). 2. When a task's CPU time goes over the RLIMIT_CPU hard limit (see function check_process_timers() in kernel/time/posix-cpu-timers.c).
I may have missed some code path, but these resource limits should be the next thing to check.
evolution-9096 [002] d..2. 1207.016564: signal_generate: sig=9 errno=0 code=0 comm=bwrap pid=9145 grp=1 res=0 evolution-9096 [002] d..2. 1207.016568: <stack trace>
=> trace_event_raw_event_signal_generate => __send_signal_locked => do_send_sig_info => do_exit => do_group_exit => get_signal => arch_do_signal_or_restart => irqentry_exit_to_user_mode => asm_sysvec_apic_timer_interrupt ... and 32 other events of bwrap cleaning up.
Does that shed any light? Other that is seems to be sending the signal to itself.
-Yanko
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
On Sat, 2024-02-03 at 13:29 +0100, Ondrej Mosnáček wrote:
On Fri, 2 Feb 2024 at 17:52, Yanko Kaneti yaneti@declera.com wrote:
On Thu, 2024-02-01 at 09:44 +0100, Ondrej Mosnáček wrote:
On Thu, 1 Feb 2024 at 09:13, Milan Crha mcrha@redhat.com wrote:
The kernel tracing log for sig==9 shows:
gnome-terminal--2924 [002] dN.2. 2520.462889: signal_generate: sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
There is no such thing (apart of the tracing log) when Evolution is suddenly killed, the logs are muted. That makes me believe it's not the OOM killer whom kills the application. I'm back at square 1; or maybe square 2, as I know possibly kernel sends it, but not why.
Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as root) and then collect the kernel trace again. That should give you a backtrace of kernel functions from the signal generation, which could help you/us to figure out the reason the process was killed.
So, figured the easiest way to help trigger the kill here is to put load on the machine.
$ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95
then starting evolution seems to do it almost every time shortly after start (I have around 200k messages in the folder its trying to display)
I've enabled the stacktrace tracing option and like Milan set a sig==9 filter. And here is what I got in the trace buffer after it was killed
# tracer: nop # # entries-in-buffer/entries-written: 34/34 #P:16 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | evolution-9096 [002] d..1. 1207.016489: signal_generate: sig=9 errno=0 code=128 comm=evolution pid=9096 grp=1 res=0 evolution-9096 [002] d..1. 1207.016495: <stack trace> => trace_event_raw_event_signal_generate => __send_signal_locked => posix_cpu_timers_work => task_work_run => irqentry_exit_to_user_mode => asm_sysvec_apic_timer_interrupt
So, browsing through the relevant kernel code, it seems the only cases which could have led to this backtrace are:
- When a task's RT timeout goes over the RLIMIT_RTTIME hard limit
(see function check_thread_timers() in kernel/time/posix-cpu-timers.c). 2. When a task's CPU time goes over the RLIMIT_CPU hard limit (see function check_process_timers() in kernel/time/posix-cpu-timers.c).
I may have missed some code path, but these resource limits should be the next thing to check.
Thanks a lot for diving it. This is probably it. The kernel killing the main evolution process which was set as having rt priority by some webkit coincidence.
Running with rtkit-daemon stoped and masked prevents the main evo process being prioritized and consequently killed with higher load.
The fact that the kernel kills the (main)thread without a beep anywhere in the logs is strange to me.
- Yanko
On Sun, 2024-02-04 at 14:03 +0200, Yanko Kaneti wrote:
The kernel killing the main evolution process which was set as having rt priority by some webkit coincidence.
Hi, there it is then. I see those:
Jan 31 10:49:22 localhost.localdomain rtkit-daemon[826]: Successfully made thread 4820 of process 4640 (/usr/bin/evolution) owned by '1000' RT at priority 5.
in the journal make more sense now.
Thank you all for the help on this.
Bye, Milan
* Yanko Kaneti:
On Sat, 2024-02-03 at 13:29 +0100, Ondrej Mosnáček wrote:
On Fri, 2 Feb 2024 at 17:52, Yanko Kaneti yaneti@declera.com wrote:
On Thu, 2024-02-01 at 09:44 +0100, Ondrej Mosnáček wrote:
On Thu, 1 Feb 2024 at 09:13, Milan Crha mcrha@redhat.com wrote:
The kernel tracing log for sig==9 shows:
gnome-terminal--2924 [002] dN.2. 2520.462889: signal_generate: sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
There is no such thing (apart of the tracing log) when Evolution is suddenly killed, the logs are muted. That makes me believe it's not the OOM killer whom kills the application. I'm back at square 1; or maybe square 2, as I know possibly kernel sends it, but not why.
Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as root) and then collect the kernel trace again. That should give you a backtrace of kernel functions from the signal generation, which could help you/us to figure out the reason the process was killed.
So, figured the easiest way to help trigger the kill here is to put load on the machine.
$ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95
then starting evolution seems to do it almost every time shortly after start (I have around 200k messages in the folder its trying to display)
I've enabled the stacktrace tracing option and like Milan set a sig==9 filter. And here is what I got in the trace buffer after it was killed
# tracer: nop # # entries-in-buffer/entries-written: 34/34 #P:16 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | evolution-9096 [002] d..1. 1207.016489: signal_generate: sig=9 errno=0 code=128 comm=evolution pid=9096 grp=1 res=0 evolution-9096 [002] d..1. 1207.016495: <stack trace> => trace_event_raw_event_signal_generate => __send_signal_locked => posix_cpu_timers_work => task_work_run => irqentry_exit_to_user_mode => asm_sysvec_apic_timer_interrupt
So, browsing through the relevant kernel code, it seems the only cases which could have led to this backtrace are:
- When a task's RT timeout goes over the RLIMIT_RTTIME hard limit
(see function check_thread_timers() in kernel/time/posix-cpu-timers.c). 2. When a task's CPU time goes over the RLIMIT_CPU hard limit (see function check_process_timers() in kernel/time/posix-cpu-timers.c).
I may have missed some code path, but these resource limits should be the next thing to check.
Thanks a lot for diving it. This is probably it. The kernel killing the main evolution process which was set as having rt priority by some webkit coincidence.
Running with rtkit-daemon stoped and masked prevents the main evo process being prioritized and consequently killed with higher load.
The fact that the kernel kills the (main)thread without a beep anywhere in the logs is strange to me.
Interesting analysis, thanks!
I think it's not so unusual for the kernel not to log the SIGKILL. This does not seem to be purely an emergency measure like the OOM killer—it's more like something that an application (set of processes) told the kernel to do to itself. As such, it's not unreasonable for the application to do the logging. On the other hand, I think it's difficult for rtkit-daemon to arrange for that logging because it can't monitor the target process directly.
Any idea why the signal is SIGKILL and not SIGXCPU? To ensure that the process actually terminates promptly, without running signal handlers or generating a coredump? SIGXCPU would point more directly to the cause.
Thanks, Florian
On Mon, 2024-02-05 at 14:31 +0100, Florian Weimer wrote:
- Yanko Kaneti:
On Sat, 2024-02-03 at 13:29 +0100, Ondrej Mosnáček wrote:
On Fri, 2 Feb 2024 at 17:52, Yanko Kaneti yaneti@declera.com wrote:
On Thu, 2024-02-01 at 09:44 +0100, Ondrej Mosnáček wrote:
On Thu, 1 Feb 2024 at 09:13, Milan Crha mcrha@redhat.com wrote:
The kernel tracing log for sig==9 shows:
gnome-terminal--2924 [002] dN.2. 2520.462889: signal_generate: sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
There is no such thing (apart of the tracing log) when Evolution is suddenly killed, the logs are muted. That makes me believe it's not the OOM killer whom kills the application. I'm back at square 1; or maybe square 2, as I know possibly kernel sends it, but not why.
Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as root) and then collect the kernel trace again. That should give you a backtrace of kernel functions from the signal generation, which could help you/us to figure out the reason the process was killed.
So, figured the easiest way to help trigger the kill here is to put load on the machine.
$ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95
then starting evolution seems to do it almost every time shortly after start (I have around 200k messages in the folder its trying to display)
I've enabled the stacktrace tracing option and like Milan set a sig==9 filter. And here is what I got in the trace buffer after it was killed
# tracer: nop # # entries-in-buffer/entries-written: 34/34 #P:16 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | evolution-9096 [002] d..1. 1207.016489: signal_generate: sig=9 errno=0 code=128 comm=evolution pid=9096 grp=1 res=0 evolution-9096 [002] d..1. 1207.016495: <stack trace> => trace_event_raw_event_signal_generate => __send_signal_locked => posix_cpu_timers_work => task_work_run => irqentry_exit_to_user_mode => asm_sysvec_apic_timer_interrupt
So, browsing through the relevant kernel code, it seems the only cases which could have led to this backtrace are:
- When a task's RT timeout goes over the RLIMIT_RTTIME hard limit
(see function check_thread_timers() in kernel/time/posix-cpu-timers.c). 2. When a task's CPU time goes over the RLIMIT_CPU hard limit (see function check_process_timers() in kernel/time/posix-cpu-timers.c).
I may have missed some code path, but these resource limits should be the next thing to check.
Thanks a lot for diving it. This is probably it. The kernel killing the main evolution process which was set as having rt priority by some webkit coincidence.
Running with rtkit-daemon stoped and masked prevents the main evo process being prioritized and consequently killed with higher load.
The fact that the kernel kills the (main)thread without a beep anywhere in the logs is strange to me.
Interesting analysis, thanks!
I think it's not so unusual for the kernel not to log the SIGKILL. This does not seem to be purely an emergency measure like the OOM killer—it's more like something that an application (set of processes) told the kernel to do to itself. As such, it's not unreasonable for the application to do the logging. On the other hand, I think it's difficult for rtkit-daemon to arrange for that logging because it can't monitor the target process directly.
Logical but leaves users like me scratching there head with mishaps like these. Given that print-fatal-signals kernel param (which I just learned about) is off by default.
Any idea why the signal is SIGKILL and not SIGXCPU? To ensure that the process actually terminates promptly, without running signal handlers or generating a coredump? SIGXCPU would point more directly to the cause.
No idea. I just saw kernel/time/poxis-cpu-timers.c after Ondrej mentioned it. SIGXCPU for soft limits and KILL for hard it seems.
- Yanko
On Mon, 2024-02-05 at 16:20 +0200, Yanko Kaneti wrote:
On Mon, 2024-02-05 at 14:31 +0100, Florian Weimer wrote:
Running with rtkit-daemon stoped and masked prevents the main evo process being prioritized and consequently killed with higher load.
The fact that the kernel kills the (main)thread without a beep anywhere in the logs is strange to me.
I think it's not so unusual for the kernel not to log the SIGKILL. This does not seem to be purely an emergency measure like the OOM killer—it's more like something that an application (set of processes) told the kernel to do to itself. As such, it's not unreasonable for the application to do the logging. On the other hand, I think it's difficult for rtkit-daemon to arrange for that logging because it can't monitor the target process directly.
Logical but leaves users like me scratching there head with mishaps like these. Given that print-fatal-signals kernel param (which I just learned about) is off by default.
FWIW, I've been running for the last couple days with print-fatal- signals=1 and the only additional logs I get are exactly the ones that would've helped in this case.
kernel: RT Watchdog Timeout (hard): evolution[4604] kernel: RT Watchdog Timeout (hard): liferea[5045]
i.e. RT watchdog kills do seem like an out-of-the-ordinary events and there is case to be made that logging in should be on by default.
-Yanko