https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
== Owner == * Name: [[User:chrismurphy| Chris Murphy]] * Email: bugzilla@colorremedies.com
== Detailed Description == Workstation working group has discussed "better interactivity in low-memory situations" for some months. In certain use cases, typically compiling, if all RAM and swap are completely consumed, system responsiveness becomes so abysmal that a reasonable user can consider the system "lost", and resorts to forcing a power off. This is objective a very bad UX. The broad discussion of this problem, and some ideas for near term and long term solutions, is located here:
Recent long discussions on "Better interactivity in low-memory situations"<br> https://pagure.io/fedora-workstation/issue/98<br> https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/...<br>
Fedora editions and spins, have the in-kernel OOM (out-of-memory) manager enabled. The manager's concern is keeping the kernel itself functioning. It has no concern about user space function or interactivity. This proposed change attempts to improve the user experience, in the short term, by triggering the in-kernel process killing mechanism, sooner. Instead of the system becoming completely unresponsive for tens of minutes, hours or days, the expectation is an offending process (determined by oom_score, same as now) will be killed off within seconds or a few minutes. This is an incremental improvement in user experience, but admittedly still suboptimal. There is additional work on-going to improve the user experience further.
Workstation working group discussion specific to enabling earlyoom by default https://pagure.io/fedora-workstation/issue/119
Other in-progress solutions:<br> https://gitlab.freedesktop.org/hadess/low-memory-monitor<br>
Background information on this complicated problem:<br> https://www.kernel.org/doc/gorman/html/understand/understand016.html<br> https://lwn.net/Articles/317814/<br>
== Benefit to Fedora ==
There are two major benefits to Fedora:
* improved user experience by more quickly regaining control over one's system, rather than having to force power off in low-memory situations where there's aggressive swapping. Once a system becomes unresponsive, it's completely reasonable for the user to assume the system is lost, but that includes high potential for data loss.
* reducing forced poweroff as the main work around will increase data collection, improving understanding of low memory situations and how to handle them better
== Scope == * Proposal owners: a. Modify {{code|https://pagure.io/fedora-comps/blob/master/f/comps-f32.xml.in%7D%7D to include earlyoom package for Workstation.<br> b. Modify {{code|https://src.fedoraproject.org/rpms/fedora-release/blob/master/f/80-workstati... to include: <pre> # enable earlyoom by default on workstation enable earlyoom.service </pre>
* Other developers: Restricted to Workstation edition, unless other editions/spins want to opt-in.
* Release engineering: [https://pagure.io/releng/issues #9141] (a check of an impact with Release Engineering is needed) <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
* Policies and guidelines: N/A * Trademark approval: N/A
== Upgrade/compatibility impact == earlyoom.service will be enabled on upgrade. An upgraded system should exhibit the same behaviors as a clean installed system.
== How To Test == * Fedora 30/31 users can test today, any edition or spin:<br> {{code|sudo dnf install earlyoom}}<br> {{code|sudo systemctl enable --now earlyoom}}
And then attempt to cause an out of memory situation. Examples:<br> {{code|tail /dev/zero}}<br> {{code|https://lkml.org/lkml/2019/8/4/15%7D%7D
* Fedora Workstation 32 (and Rawhide) users will see this service is already enabled. It can be toggled with {{code|sudo systemctl start/stop earlyoom}} where start means earlyoom is running, and stop means earlyoom is not running.
== User Experience == The most egregious instances this change is trying to mitigate: a. RAM is completely used b. Swap is completely used c. System becomes unresponsive to the user as swap thrashing has ensued --> earlyoom disabled, the user often gives up and forces power off (in my own testing this condition lasts >30 minutes with no kernel triggered oom killer and no recovery) --> earlyoom enabled, the system likely still becomes unresponsive but oom killer is triggered in much less time (seconds or a few minutes, in my testing, after less than 10% RAM and 10% swap is remaining)
earlyoom starts sending SIGTERM once both memory and swap are below their respective PERCENT setting, default 10%. It sends SIGKILL once both are below their respective KILL_PERCENT setting, default 5%.
The package includes configuration file /etc/default/earlyoom which sets option {{code|-r 60}} causing a memory report to be entered into the journal every minute.
== Dependencies == earlyoom package has no dependencies
== Contingency Plan == * Contingency mechanism: Owner will revert all changes * Contingency deadline: Final freeze * Blocks release? No * Blocks product? No
== Documentation == {{code|man earlyoom}}<br><br> https://www.kernel.org/doc/gorman/html/understand/understand016.html
== Release Notes == Earlyoom service is enabled by default, which will cause kernel oom-killer to trigger sooner. To revert to previous behavior:<br> {{code|sudo systemctl disable earlyoom.service}}
And to customize see {{code|man earlyoom}}.
Ben Cotton bcotton@redhat.com writes:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
# enable earlyoom by default on workstation enable earlyoom.service </pre>
The OOM killer is a kernel function. I have no opinion on this proposal as it stands, but I would like it to include an explanation of why this requires a service in userspace to fix.
Thanks, --Robbie
Robbie Harwood rharwood@redhat.com writes:
Ben Cotton bcotton@redhat.com writes:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
# enable earlyoom by default on workstation enable earlyoom.service </pre>
The OOM killer is a kernel function. I have no opinion on this proposal as it stands, but I would like it to include an explanation of why this requires a service in userspace to fix.
Another thought. Wouldn't some of the pain here be alleviated by setting vm.swappiness=0? Currently it seems to be 60, which results in somewhat aggressive swap use; 1 seems better (minimal swapping without disabling), while 0 will disable it for general use (while preserving it for hibernation). This would at least improve the disk thrashing during OOM situations.
Thanks, --Robbie
On Friday, January 3, 2020 1:51:00 PM MST Robbie Harwood wrote:
Robbie Harwood rharwood@redhat.com writes:
Ben Cotton bcotton@redhat.com writes:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
# enable earlyoom by default on workstation enable earlyoom.service </pre>
The OOM killer is a kernel function. I have no opinion on this proposal as it stands, but I would like it to include an explanation of why this requires a service in userspace to fix.
Another thought. Wouldn't some of the pain here be alleviated by setting vm.swappiness=0? Currently it seems to be 60, which results in somewhat aggressive swap use; 1 seems better (minimal swapping without disabling), while 0 will disable it for general use (while preserving it for hibernation). This would at least improve the disk thrashing during OOM situations.
Thanks, --Robbie
To clarify, according to the Workstation group, hibernation isn't even supported.
Regardless, if this Change is accepted, it should probably be done on a per- spin basis. If the GNOME Spin wants this, that's one thing, but I don't believe this would be a good idea on servers.
On Fri, Jan 3, 2020 at 10:14 PM John M. Harris Jr johnmh@splentity.com wrote:
Regardless, if this Change is accepted, it should probably be done on a per- spin basis. If the GNOME Spin wants this, that's one thing, but I don't believe this would be a good idea on servers.
Yes, and if you read the change wiki page mentioned in the announcement email, it's meant only for Workstation: https://fedoraproject.org/wiki/Changes/EnableEarlyoom#Scope
Restricted to Workstation edition, unless other editions/spins want to
opt-in.
And to be precise, there is Fedora Workstation (Edition of Fedora) and then there are other Spins. Workstation is not a spin :)
"John M. Harris Jr" johnmh@splentity.com writes:
On Friday, January 3, 2020 1:51:00 PM MST Robbie Harwood wrote:
Robbie Harwood rharwood@redhat.com writes:
Ben Cotton bcotton@redhat.com writes:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
# enable earlyoom by default on workstation enable earlyoom.service </pre>
The OOM killer is a kernel function. I have no opinion on this proposal as it stands, but I would like it to include an explanation of why this requires a service in userspace to fix.
Another thought. Wouldn't some of the pain here be alleviated by setting vm.swappiness=0? Currently it seems to be 60, which results in somewhat aggressive swap use; 1 seems better (minimal swapping without disabling), while 0 will disable it for general use (while preserving it for hibernation). This would at least improve the disk thrashing during OOM situations.
To clarify, according to the Workstation group, hibernation isn't even supported.
If that's true - and I don't know how I'd check it, so I didn't - we should revisit enabling swap in the default install, and *definitely* should remove the warning for not having it from anaconda.
Thanks, --Robbie
On Mon, Jan 6, 2020 at 12:11 PM Robbie Harwood rharwood@redhat.com wrote:
"John M. Harris Jr" johnmh@splentity.com writes:
On Friday, January 3, 2020 1:51:00 PM MST Robbie Harwood wrote:
Robbie Harwood rharwood@redhat.com writes:
Ben Cotton bcotton@redhat.com writes:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
# enable earlyoom by default on workstation enable earlyoom.service </pre>
The OOM killer is a kernel function. I have no opinion on this proposal as it stands, but I would like it to include an explanation of why this requires a service in userspace to fix.
Another thought. Wouldn't some of the pain here be alleviated by setting vm.swappiness=0? Currently it seems to be 60, which results in somewhat aggressive swap use; 1 seems better (minimal swapping without disabling), while 0 will disable it for general use (while preserving it for hibernation). This would at least improve the disk thrashing during OOM situations.
To clarify, according to the Workstation group, hibernation isn't even supported.
If that's true - and I don't know how I'd check it, so I didn't - we should revisit enabling swap in the default install, and *definitely* should remove the warning for not having it from anaconda.
It's not correct that the Workstation working group doesn't want to see it supported, it's a question of whether and to what degree it can be supported, and making sure users have expectations proper set. I wouldn't want users thinking it'll work by advertising that it does, and then it eats their data.
Does the hardware support it? Does the hardware properly advertise what it does support? What mechanisms are needed in the kernel and systemd to support it, and what to do when there are bugs that break it? It's not practical for the Fedora kernel team to become responsible for supporting it when it breaks, nor is it practical to block the release on such bugs. The most recent topic I found on this:
Disabling kernel's hibernate support by default, allow re-enabling it with a kernel cmdline option https://lists.fedoraproject.org/archives/list/kernel@lists.fedoraproject.org...
As for swap size options including no swap, and maybe swap-on-ZRAM: https://pagure.io/fedora-workstation/issue/120 https://bugzilla.redhat.com/show_bug.cgi?id=1731978
There are all kinds of useful and necessary discussions to have there (rather than here).
-- Chris Murphy
Chris Murphy lists@colorremedies.com writes:
Robbie Harwood rharwood@redhat.com wrote:
"John M. Harris Jr" johnmh@splentity.com writes:
On Friday, January 3, 2020 1:51:00 PM MST Robbie Harwood wrote:
Robbie Harwood rharwood@redhat.com writes:
Ben Cotton bcotton@redhat.com writes:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
# enable earlyoom by default on workstation enable earlyoom.service </pre>
The OOM killer is a kernel function. I have no opinion on this proposal as it stands, but I would like it to include an explanation of why this requires a service in userspace to fix.
Another thought. Wouldn't some of the pain here be alleviated by setting vm.swappiness=0? Currently it seems to be 60, which results in somewhat aggressive swap use; 1 seems better (minimal swapping without disabling), while 0 will disable it for general use (while preserving it for hibernation). This would at least improve the disk thrashing during OOM situations.
To clarify, according to the Workstation group, hibernation isn't even supported.
If that's true - and I don't know how I'd check it, so I didn't - we should revisit enabling swap in the default install, and *definitely* should remove the warning for not having it from anaconda.
It's not correct that the Workstation working group doesn't want to see it supported, it's a question of whether and to what degree it can be supported, and making sure users have expectations proper set. I wouldn't want users thinking it'll work by advertising that it does, and then it eats their data.
I think enabling it by default very strongly suggests it's supported, regardless of what the intentions are. I have no quarrel with the kernel team in either direction they wish to decide (supported or non), but if it's non-supported, it shouldn't look like it's supported.
As for swap size options including no swap, and maybe swap-on-ZRAM: https://pagure.io/fedora-workstation/issue/120 https://bugzilla.redhat.com/show_bug.cgi?id=1731978
There are all kinds of useful and necessary discussions to have there (rather than here).
The links are appreciated; I was not aware of these discussions and will follow them. However, since we're discussing behavior of the system under heavy load, I think how we handle swap (the thing that makes it slow down when you're low on memory...) is extremely relevant.
Thanks, --Robbie
On Mon, Jan 6, 2020 at 1:14 PM Robbie Harwood rharwood@redhat.com wrote:
Chris Murphy lists@colorremedies.com writes:
As for swap size options including no swap, and maybe swap-on-ZRAM: https://pagure.io/fedora-workstation/issue/120 https://bugzilla.redhat.com/show_bug.cgi?id=1731978
There are all kinds of useful and necessary discussions to have there (rather than here).
The links are appreciated; I was not aware of these discussions and will follow them. However, since we're discussing behavior of the system under heavy load, I think how we handle swap (the thing that makes it slow down when you're low on memory...) is extremely relevant.
It's perhaps the most relevant thing, it's what's to be avoided because it causes the responsiveness problem in the first place. I just meant in terms of this feature proposal, there are no swap changes. And what to change is elsewhere, and elsewhen. :D
Chris Murphy wrote:
It's not correct that the Workstation working group doesn't want to see it supported, it's a question of whether and to what degree it can be supported, and making sure users have expectations proper set. I wouldn't want users thinking it'll work by advertising that it does, and then it eats their data.
Does the hardware support it? Does the hardware properly advertise what it does support? What mechanisms are needed in the kernel and systemd to support it, and what to do when there are bugs that break it? It's not practical for the Fedora kernel team to become responsible for supporting it when it breaks, nor is it practical to block the release on such bugs.
The biggest issue is that "Secure Boot" (Restricted Boot) mode disables hibernation.
Kevin Kofler
On Fri, Jan 3, 2020 at 1:51 PM Robbie Harwood rharwood@redhat.com wrote:
Another thought. Wouldn't some of the pain here be alleviated by setting vm.swappiness=0?
My sample size is not scientific. But, in my testing I can't tell any difference for the swap under pressure case we're testing against. The system is lost in the same amount of time, system still does not recover on its own. Perhaps swappiness matters for the less extreme case of incidental swap usage, which is probably what swap was originally intended for (?) but that's speculation on my part.
The central problem as I see it, unprivileged applications *by default* are given a memory allocation that they request, without any consideration for the health of other processes, even privileged processes.
The kernel oom-killer (with or without earlyoom running), often clobbers things like sshd, systemd-journald, sssd, and even user space programs (Maps, TextEdit, Terminal) that have nothing to do with what's actually eating up CPU and memory. It's pretty atrocious, but I've editorialized plenty about that in the cited threads already.
People smarter than I am are working on more long term solutions for this. This is a bit of a hack, intended to return some sense of control back to the user sooner, so they can save their state (however they define that) and reboot normally, rather than having to force power off. It's unsophisticated and therefore mismatched with a complex problem, but elegant because it's simple, easy to test, easy to remove or disable.
Another plus I only briefly mention in the proposal, is that because the user is far less likely to hard power off, their system journal will have certainly recorded earlyoom's memory report leading up to the oom, as well as the complete kernel oom-killer output. Whereas in many of my tests without earlyoom, forced power off can cause a lot of this information to get lost, especially what action oom-killer took assuming it even triggered at all which often it doesn't and the system remained wedged in and unresponsive for 30+ minutes.
Currently it seems to be 60, which results in somewhat aggressive swap use; 1 seems better (minimal swapping without disabling), while 0 will disable it for general use (while preserving it for hibernation). This would at least improve the disk thrashing during OOM situations.
Sounds right, but in practice I'm not observing this. The two observation perspectives I'm using:
a) GUI responsiveness: ability to drag windows around, scroll in Firefox, type text in textedit, open/save files. b) remote (ssh) observation with vmstat, iotop, and top
On Friday, January 3, 2020 3:48:50 PM MST Chris Murphy wrote:
On Fri, Jan 3, 2020 at 1:51 PM Robbie Harwood rharwood@redhat.com wrote:
Another thought. Wouldn't some of the pain here be alleviated by setting vm.swappiness=0?
My sample size is not scientific. But, in my testing I can't tell any difference for the swap under pressure case we're testing against. The system is lost in the same amount of time, system still does not recover on its own. Perhaps swappiness matters for the less extreme case of incidental swap usage, which is probably what swap was originally intended for (?) but that's speculation on my part.
The central problem as I see it, unprivileged applications *by default* are given a memory allocation that they request, without any consideration for the health of other processes, even privileged processes.
The kernel oom-killer (with or without earlyoom running), often clobbers things like sshd, systemd-journald, sssd, and even user space programs (Maps, TextEdit, Terminal) that have nothing to do with what's actually eating up CPU and memory. It's pretty atrocious, but I've editorialized plenty about that in the cited threads already.
People smarter than I am are working on more long term solutions for this. This is a bit of a hack, intended to return some sense of control back to the user sooner, so they can save their state (however they define that) and reboot normally, rather than having to force power off. It's unsophisticated and therefore mismatched with a complex problem, but elegant because it's simple, easy to test, easy to remove or disable.
Another plus I only briefly mention in the proposal, is that because the user is far less likely to hard power off, their system journal will have certainly recorded earlyoom's memory report leading up to the oom, as well as the complete kernel oom-killer output. Whereas in many of my tests without earlyoom, forced power off can cause a lot of this information to get lost, especially what action oom-killer took assuming it even triggered at all which often it doesn't and the system remained wedged in and unresponsive for 30+ minutes.
Currently it seems to be 60, which results in
somewhat aggressive swap use; 1 seems better (minimal swapping without disabling), while 0 will disable it for general use (while preserving it for hibernation). This would at least improve the disk thrashing during OOM situations.
Sounds right, but in practice I'm not observing this. The two observation perspectives I'm using:
a) GUI responsiveness: ability to drag windows around, scroll in Firefox, type text in textedit, open/save files. b) remote (ssh) observation with vmstat, iotop, and top
There is NO scenario in which hard shutdowns should occur, except battery failure on mobile devices. The state of the system on boot will vary wildly from what you may expect when it is hard powered off. I would suggest using SysRq in such events.
John M. Harris Jr wrote:
There is NO scenario in which hard shutdowns should occur, except battery failure on mobile devices. The state of the system on boot will vary wildly from what you may expect when it is hard powered off. I would suggest using SysRq in such events.
Unfortunately, SysRq does nothing by default on Fedora, for security reasons.
Kevin Kofler
On Fri, Jan 3, 2020 at 3:57 PM John M. Harris Jr johnmh@splentity.com wrote:
There is NO scenario in which hard shutdowns should occur, except battery failure on mobile devices. The state of the system on boot will vary wildly from what you may expect when it is hard powered off. I would suggest using SysRq in such events.
Yes I know all about sysrq. Everyday ordinary users do not, and I'm not going to teach them about it because
a) I already know the outcome: eyes glaze over, then they say "yeah whatever I'll just force power off, works fine - except for the data loss" b) much of the time, I couldn't get to a VT, and sshd was hung or even got killed by oom-killer. So I couldn't do sysrq anyway. c) in the cases were I could issue syrq+b, responsiveness was so bad it'd take upwards of 15 minutes just to type out the command
So yeah, screw it, I'm pressing the power button
These aren't contrived cases. They are real world. Baremetal and VM reproducible. And unprivileged processes. This is fully discussed in detail in the devel@ thread I reference in the proposal, and I'm not going to repeat myself in this thread.
On Friday, January 3, 2020 4:25:20 PM MST Chris Murphy wrote:
in the cases were I could issue syrq+b, responsiveness was so bad it'd take upwards of 15 minutes just to type out the command
In that case, I'd suggest waiting the 15 minutes, and then not bogging down your system that badly the next time. This is, really, the best option.
Den lör 4 jan. 2020 kl 01:53 skrev John M. Harris Jr johnmh@splentity.com:
On Friday, January 3, 2020 4:25:20 PM MST Chris Murphy wrote:
in the cases were I could issue syrq+b, responsiveness was so bad it'd take upwards of 15 minutes just to type out the command
In that case, I'd suggest waiting the 15 minutes, and then not bogging down your system that badly the next time. This is, really, the best option.
*Remembers to be excellent to each other.* Or maybe we should try to make operating systems that actually work under heavy load.
/Andreas
-- John M. Harris, Jr. Splentity
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Friday, January 3, 2020 11:34:13 PM MST Andreas Tunek wrote:
Den lör 4 jan. 2020 kl 01:53 skrev John M. Harris Jr johnmh@splentity.com:
On Friday, January 3, 2020 4:25:20 PM MST Chris Murphy wrote:
in the cases were I could issue syrq+b, responsiveness was so bad it'd take upwards of 15 minutes just to type out the command
In that case, I'd suggest waiting the 15 minutes, and then not bogging down your system that badly the next time. This is, really, the best option.
*Remembers to be excellent to each other.* Or maybe we should try to make operating systems that actually work under heavy load.
If we had something that would "actually work under heavy load" (we do, but it doesn't work for some people), then my advice wouldn't be necessary. However, what I've said is for the safety of your installed system. It should only be followed if the integrity of your data is important to you.
On Fri, Jan 3, 2020 at 5:52 pm, John M. Harris Jr johnmh@splentity.com wrote:
In that case, I'd suggest waiting the 15 minutes, and then not bogging down your system that badly the next time. This is, really, the best option.
I'm going to suggest you stop replying in this thread if you're not interested in responding with productive comments.
The user experience requirement here is "desktop should not hang for 15 minutes when under memory pressure." Your comment indicates that it *should* hang, presumably to punish users for using too much memory. This is so absurd that I don't think you're engaging in good-faith discussion anymore.
On Saturday, January 4, 2020 11:16:24 AM MST Michael Catanzaro wrote:
On Fri, Jan 3, 2020 at 5:52 pm, John M. Harris Jr johnmh@splentity.com wrote:
In that case, I'd suggest waiting the 15 minutes, and then not bogging down your system that badly the next time. This is, really, the best option.
I'm going to suggest you stop replying in this thread if you're not interested in responding with productive comments.
The user experience requirement here is "desktop should not hang for 15 minutes when under memory pressure." Your comment indicates that it *should* hang, presumably to punish users for using too much memory. This is so absurd that I don't think you're engaging in good-faith discussion anymore.
Whether or not it should or should not is irrelevant. I don't see much of an alternative than what seems to be a "hang", honestly. It has nothing to do with something to "punish" users, it's to get the system to a state where you can `sync` and reboot.
On Sat, Jan 4, 2020 at 11:20 AM John M. Harris Jr johnmh@splentity.com wrote:
On Saturday, January 4, 2020 11:16:24 AM MST Michael Catanzaro wrote:
On Fri, Jan 3, 2020 at 5:52 pm, John M. Harris Jr johnmh@splentity.com wrote:
In that case, I'd suggest waiting the 15 minutes, and then not bogging down your system that badly the next time. This is, really, the best option.
I'm going to suggest you stop replying in this thread if you're not interested in responding with productive comments.
The user experience requirement here is "desktop should not hang for 15 minutes when under memory pressure." Your comment indicates that it *should* hang, presumably to punish users for using too much memory. This is so absurd that I don't think you're engaging in good-faith discussion anymore.
Whether or not it should or should not is irrelevant. I don't see much of an alternative than what seems to be a "hang", honestly. It has nothing to do with something to "punish" users, it's to get the system to a state where you can `sync` and reboot.
The point of this feature proposal is precisely to get the system into a state where they can save their work and do a proper reboot. It's safer, less esoteric, and more reliable than sysrq+b.
It cannot become a user's burden to know the kernel is still doing something, when there's zero feedback and zero control. When will the system recover on its own? An hour? A day? A week? I can tell you for sure in my test case, it was consistently stuck for > 30 minutes. I let it go that long, many times, only to demonstrate it's not a temporary hang, and users are acting rationally to force power off.
On Fri, Jan 3, 2020 at 1:35 PM Robbie Harwood rharwood@redhat.com wrote:
The OOM killer is a kernel function.
Yes, this is indicated in the summary.
I have no opinion on this proposal as it stands, but I would like it to include an explanation of why this requires a service in userspace to fix.
The 2nd full paragraph of the "detailed description" section describes intended kernel function. What I don't say there, is that kernel folks don't consider the oomkiller behavior to be broken, it's working exactly as intended. It keeps the kernel alive. Its concern has nothing to do with keeping user space responsive.
On Fri, Jan 3, 2020 at 2:19 PM Ben Cotton bcotton@redhat.com wrote:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
I'd like to see this enabled on all Fedora variants by default. This seems to be generally useful for workstations and servers...
-- 真実はいつも一つ!/ Always, there's only one truth!
On Friday, January 3, 2020, Neal Gompa ngompa13@gmail.com wrote:
On Fri, Jan 3, 2020 at 2:19 PM Ben Cotton bcotton@redhat.com wrote:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
I'd like to see this enabled on all Fedora variants by default. This seems to be generally useful for workstations and servers.
The idea might be the implementation is not. Using a percentage to decide "almost out of memory" is going to hurt on systems with large amounts of memory be it a 32gb desktop or a 2tb server. You'd have plenty of memory left and it starts killing processes ...
On Fri, Jan 3, 2020 at 3:41 PM drago01 drago01@gmail.com wrote:
On Friday, January 3, 2020, Neal Gompa ngompa13@gmail.com wrote:
On Fri, Jan 3, 2020 at 2:19 PM Ben Cotton bcotton@redhat.com wrote:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
I'd like to see this enabled on all Fedora variants by default. This seems to be generally useful for workstations and servers.
The idea might be the implementation is not. Using a percentage to decide "almost out of memory" is going to hurt on systems with large amounts of memory be it a 32gb desktop or a 2tb server. You'd have plenty of memory left and it starts killing processes ...
I agree this is the most significant liability of the proposal right now. I mention it in: https://pagure.io/fedora-workstation/issue/119#comment-618480
And I will add this caveat to the proposal, because at the moment I think we need a satisfactory work around in order to proceed with enabling this feature.
drago01 wrote:
The idea might be the implementation is not. Using a percentage to decide "almost out of memory" is going to hurt on systems with large amounts of memory be it a 32gb desktop or a 2tb server. You'd have plenty of memory left and it starts killing processes ...
And it will lead to the opposite problem on systems with lots of swap (e.g., my desktop PC that still follows the swap = 2 * RAM recommendation and hence has 16 GiB of RAM and 32 GiB of swap): processes can do a lot of swap thrashing in 32 * (1-10%) = 28.8 GiB of swap and still ruin the system interactivity.
OOM handling really needs to use some interactivity metric, and I think this can almost certainly be done reliably only in the kernel, because userspace processes do not even get to run if the interactivity is already ruined.
Kevin Kofler
On Sat, Jan 4, 2020 at 2:33 AM Vitaly Zaitsev via devel devel@lists.fedoraproject.org wrote:
On 03.01.2020 22:27, Neal Gompa wrote:
and servers...
Admins will be very happy when such user-space killer will kill for example PgSQL database server and cause DB corruption or loss of banking transactions.
This is already happening anyway. The idea is that earlyoom will just do it slightly earlier so we have a responsive system when the failures happen. Unlike a lot of the other options, earlyoom is just doing what the kernel does, just slightly earlier so that the system doesn't become unresponsive. That is *hugely* valuable for sysadmins to be able to recover the systems without power cycling. As a sysadmin myself, I *hate* power cycling servers because it takes forever and its a lot bigger loss of productivity (and potentially money!).
On Saturday, January 4, 2020, Neal Gompa ngompa13@gmail.com wrote:
On Sat, Jan 4, 2020 at 2:33 AM Vitaly Zaitsev via devel devel@lists.fedoraproject.org wrote:
On 03.01.2020 22:27, Neal Gompa wrote:
and servers...
Admins will be very happy when such user-space killer will kill for example PgSQL database server and cause DB corruption or loss of banking transactions.
This is already happening anyway. The idea is that earlyoom will just do it slightly earlier so we have a responsive system when the failures happen. Unlike a lot of the other options, earlyoom is just doing what the kernel does, just slightly earlier so that the system doesn't become unresponsive.
That is *hugely* valuable for sysadmins to be able to recover the systems without power cycling. As a sysadmin myself, I *hate* power cycling servers because it takes forever and its a lot bigger loss of productivity (and potentially money!
Except that slightly earlier is way to early on systems which have lots of memory (see mails from before).
And if a server runs into a oom situation your software is either broken (leaking) or you didn't allocate enough resources for your use case.
So the fix is not oom killing nor power cycling but to either allocate more memory of it is a VM or buy more if it is a hardware server (or fix the memory leak in your software).
As for the desktop case the running web browers in a cgroup to keep them in check would solve most real world problems - other common desktop apps don't use enough memory to cause such issues (unless your system is really memory constrained but then the "buy more memory" solution is the better fix).
And btw we should really update the minimum memory requirements in our documentation, the current ones have nothing to do with reality (if you want a pleasant user experience).
Let's keep this desktop-focused, since the proposal does not affect Server edition.
On Sat, Jan 4, 2020 at 12:48 pm, drago01 drago01@gmail.com wrote:
As for the desktop case the running web browers in a cgroup to keep them in check would solve most real world problems - other common desktop apps don't use enough memory to cause such issues (unless your system is really memory constrained but then the "buy more memory" solution is the better fix).
The last time I saw my desktop hang due to a web browser using too much memory was 2015.
The freezes I've encountered in the past five years were all related to software development:
* When compiling large software projects, it's possible to run out of RAM either when building lots of files in parallel, or when linking * GNOME Builder runs ctags, and ctags likes to use dozens of GB of RAM to index large software projects. I think it sometimes gets into a loop where it just allocates more and more RAM until the desktop dies
On Sat, Jan 04, 2020 at 12:15:20PM -0600, Michael Catanzaro wrote:
Let's keep this desktop-focused, since the proposal does not affect Server edition.
On Sat, Jan 4, 2020 at 12:48 pm, drago01 drago01@gmail.com wrote:
As for the desktop case the running web browers in a cgroup to keep them in check would solve most real world problems - other common desktop apps don't use enough memory to cause such issues (unless your system is really memory constrained but then the "buy more memory" solution is the better fix).
The last time I saw my desktop hang due to a web browser using too much memory was 2015.
just FTR, this happens relatively frequently for me. Some websites seem to cause Firefox to swap itself into nirvana. Sometimes within a short time but sometimes it takes longer. I've come back from a lunch break a few times to a desktop swapping itself to death.
Not yet fully identified but it does happen.
Cheers, Peter
The freezes I've encountered in the past five years were all related to software development:
- When compiling large software projects, it's possible to run out of RAM
either when building lots of files in parallel, or when linking
- GNOME Builder runs ctags, and ctags likes to use dozens of GB of RAM to
index large software projects. I think it sometimes gets into a loop where it just allocates more and more RAM until the desktop dies
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
I am not using a swap partition at all, the system always hangs when OOM but sometimes also at just less than 20%
On Tue, Jan 7, 2020 at 8:43 AM Peter Hutterer peter.hutterer@who-t.net wrote:
On Sat, Jan 04, 2020 at 12:15:20PM -0600, Michael Catanzaro wrote:
Let's keep this desktop-focused, since the proposal does not affect Server edition.
On Sat, Jan 4, 2020 at 12:48 pm, drago01 drago01@gmail.com wrote:
As for the desktop case the running web browers in a cgroup to keep them in check would solve most real world problems - other common desktop apps don't use enough memory to cause such issues (unless your system is really memory constrained but then the "buy more memory" solution is the better fix).
The last time I saw my desktop hang due to a web browser using too much memory was 2015.
just FTR, this happens relatively frequently for me. Some websites seem to cause Firefox to swap itself into nirvana. Sometimes within a short time but sometimes it takes longer. I've come back from a lunch break a few times to a desktop swapping itself to death.
Not yet fully identified but it does happen.
Cheers, Peter
The freezes I've encountered in the past five years were all related to software development:
- When compiling large software projects, it's possible to run out of RAM
either when building lots of files in parallel, or when linking
- GNOME Builder runs ctags, and ctags likes to use dozens of GB of RAM to
index large software projects. I think it sometimes gets into a loop where it just allocates more and more RAM until the desktop dies
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Tue, Jan 07, 2020 at 08:57:04AM +0200, Damian Ivanov wrote:
I am not using a swap partition at all, the system always hangs when OOM but sometimes also at just less than 20%
This might be https://gitlab.gnome.org/GNOME/gnome-shell/issues/1981 or one of the duplicates.
Zbyszek
On Saturday, January 4, 2020 4:48:04 AM MST drago01 wrote:
And btw we should really update the minimum memory requirements in our documentation, the current ones have nothing to do with reality (if you want a pleasant user experience).
That is not necessary, at all. I'm running Fedora on a 2009 Core 2 Duo system with 2 GiB of RAM, and have not had any issues, after disabling the compositor in Plasma. My daily driver is an X200 Tablet with 4 GiB of RAM, similarly. These amounts are more than sufficient for most users, and Firefox has never led my system to an OOM event. The only time I've ever run into an OOM on this system was while compiling some poorly written software (that I wrote, years ago).
On Sat, Jan 4, 2020 at 4:48 AM drago01 drago01@gmail.com wrote:
On Saturday, January 4, 2020, Neal Gompa ngompa13@gmail.com wrote:
On Sat, Jan 4, 2020 at 2:33 AM Vitaly Zaitsev via devel devel@lists.fedoraproject.org wrote:
On 03.01.2020 22:27, Neal Gompa wrote:
and servers...
Admins will be very happy when such user-space killer will kill for example PgSQL database server and cause DB corruption or loss of banking transactions.
This is already happening anyway. The idea is that earlyoom will just do it slightly earlier so we have a responsive system when the failures happen. Unlike a lot of the other options, earlyoom is just doing what the kernel does, just slightly earlier so that the system doesn't become unresponsive.
That is *hugely* valuable for sysadmins to be able to recover the systems without power cycling. As a sysadmin myself, I *hate* power cycling servers because it takes forever and its a lot bigger loss of productivity (and potentially money!
Except that slightly earlier is way to early on systems which have lots of memory (see mails from before).
It might be. And it might need to be tweaked. Perhaps 6% for SIGTERM and 3% for SIGKILL. Or even 5% and 2.5%. For sure using a percentage of RAM and swap is too simplistic. But it's easy for users to understand. Something more sophisticated, based on kernel pressure stall information would likely be better, and folks are working on that.
And if a server runs into a oom situation your software is either broken (leaking) or you didn't allocate enough resources for your use case.
So the fix is not oom killing nor power cycling but to either allocate more memory of it is a VM or buy more if it is a hardware server (or fix the memory leak in your software).
That's not a fix either, it's a work around that papers over the problem. Same as earlyoom, except RAM costs money, and may not be an option due to hardware limitations. A modern operating system needs to know better than to allow unprivileged processes to take down the whole system.
And btw we should really update the minimum memory requirements in our documentation, the current ones have nothing to do with reality (if you want a pleasant user experience).
Can you be more specific?
On getfedora.org it reads: Fedora requires a minimum of 20GB disk, 2GB RAM, to install and run successfully. Double those amounts is recommended.
On Saturday, January 4, 2020 11:31:49 AM MST Chris Murphy wrote:
A modern operating system needs to know better than to allow unprivileged processes to take down the whole system.
Well, you can configure quotas if you really want, but the idea is that it's YOUR COMPUTER, and you should be able to use it however you like. If you want to run software that requires more RAM than your system has, you can do that, and it will run, just not well.
On Sat, Jan 4, 2020 at 7:32 PM Chris Murphy lists@colorremedies.com wrote:
On Sat, Jan 4, 2020 at 4:48 AM drago01 drago01@gmail.com wrote:
On Saturday, January 4, 2020, Neal Gompa ngompa13@gmail.com wrote:
On Sat, Jan 4, 2020 at 2:33 AM Vitaly Zaitsev via devel devel@lists.fedoraproject.org wrote:
On 03.01.2020 22:27, Neal Gompa wrote:
and servers...
Admins will be very happy when such user-space killer will kill for example PgSQL database server and cause DB corruption or loss of banking transactions.
This is already happening anyway. The idea is that earlyoom will just do it slightly earlier so we have a responsive system when the failures happen. Unlike a lot of the other options, earlyoom is just doing what the kernel does, just slightly earlier so that the system doesn't become unresponsive.
That is *hugely* valuable for sysadmins to be able to recover the systems without power cycling. As a sysadmin myself, I *hate* power cycling servers because it takes forever and its a lot bigger loss of productivity (and potentially money!
Except that slightly earlier is way to early on systems which have lots of memory (see mails from before).
It might be. And it might need to be tweaked. Perhaps 6% for SIGTERM and 3% for SIGKILL. Or even 5% and 2.5%. For sure using a percentage of RAM and swap is too simplistic. But it's easy for users to understand. Something more sophisticated, based on kernel pressure stall information would likely be better, and folks are working on that.
Yes that would be a way better metric than a percent value which is either to close to full ram or to early if you have lots of ram. 6% of 4GB is 254MB while for 32GB its almost 2GB - killing processes while you have 2GB left is just wasteful.
And if a server runs into a oom situation your software is either broken (leaking) or you didn't allocate enough resources for your use case.
So the fix is not oom killing nor power cycling but to either allocate more memory of it is a VM or buy more if it is a hardware server (or fix the memory leak in your software).
That's not a fix either, it's a work around that papers over the problem. Same as earlyoom, except RAM costs money, and may not be an option due to hardware limitations. A modern operating system needs to know better than to allow unprivileged processes to take down the whole system.
I think you misunderstood me. Yes the OS should behave better than this but if you are running a server you don't want your DB, web server to not be reachable because the system run out of memory - the only way to "fix" that is to provide enough resources. No amount of OOM killing would help you here. The system may be up but not the server process the machine is running for ...
And btw we should really update the minimum memory requirements in our documentation, the current ones have nothing to do with reality (if you want a pleasant user experience).
Can you be more specific?
On getfedora.org it reads: Fedora requires a minimum of 20GB disk, 2GB RAM, to install and run successfully. Double those amounts is recommended.
I simply do not think 2GB is sufficient, the "recommended double" i.e 4GB should be the "required" and drop the double part all together. A modern desktop with apps on top will not run well enough on 2GB, lets stop pretending it does. But anyways that's off topic as it is not part of the proposal.
On Sat, Jan 4, 2020 at 2:30 PM drago01 drago01@gmail.com wrote:
On Sat, Jan 4, 2020 at 7:32 PM Chris Murphy lists@colorremedies.com wrote:
It might be. And it might need to be tweaked. Perhaps 6% for SIGTERM and 3% for SIGKILL. Or even 5% and 2.5%. For sure using a percentage of RAM and swap is too simplistic. But it's easy for users to understand. Something more sophisticated, based on kernel pressure stall information would likely be better, and folks are working on that.
Yes that would be a way better metric than a percent value which is either to close to full ram or to early if you have lots of ram. 6% of 4GB is 254MB while for 32GB its almost 2GB - killing processes while you have 2GB left is just wasteful.
If there's a swap device, that won't happen. The case where SIGTERM really happens at 10% RAM free, is when there's no swap device. And even though the no swap device configuration is not a default, and explicitly not recommended, right now, by the installer (as in, if you try to do such an installation, it warns you) - it is a configuration we allow, and I happen to know it's somewhat common among developers with systems with lots of RAM expressly because swap thrashing even to SSD results in such poor UX.
Consider the following 'vmstat 10' while doing a compile:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st
6 11 4168060 1821580 40 736604 30234 10841 46533 13805 19230 29799 74 12 1 13 0
At this time, the GUI was completely unresponsive, not even the mouse arrow moves, for about 1 minute. Seemingly plenty of RAM and swap, and idle CPU. But rather heavy swap in and out.
10 9 4459648 200912 40 569260 11218 18856 28846 19997 15164 35256 28 9 9 53 0 6 8 4207328 807092 40 636156 26205 16744 35472 18287 20179 34087 62 12 3 23 0
At these two lines, the mouse arrow is stuttering, the GUI is very sluggish, even unresponsive much of the time.
Jan 04 15:37:18 fmac.local earlyoom[4896]: mem avail: 1212 of 7865 MiB (15 %), swap free: 4807 of 8195 MiB (58 %)
Near the same time. The system is no where near either RAM or swap exhaustion. But swap si/so are high. This is an SSD BTW.
Can I get to the compile and force quit? Eventually, it would take a couple minutes. But good progress is being made with the compile during this whole time.
earlyoom doesn't SIGTERM this compile until 20 minutes of this behavior. With default settings. So it really isn't solving the sluggish, stuttering problem. But what does happen, is it SIGTERMs the compile before the system gets to a state where essentially all of the work is only swap in and swap out, and no other work is being done.
Here is the output (2 week expiration) https://pastebin.com/0iZHNjg7
Retest with no swap at all, and yes, compile gets a SIGTERM when free memory gets to 10% (because swap is already considered to be 0% free, since it doesn't exist). But also? The system isn't under any swap io duress. The system is completely responsive throughout.
This is why we see developers giving up on swap partitions entirely. swap-on-ZRAM might be a compromise. That's related issue #120.
That's not a fix either, it's a work around that papers over the problem. Same as earlyoom, except RAM costs money, and may not be an option due to hardware limitations. A modern operating system needs to know better than to allow unprivileged processes to take down the whole system.
I think you misunderstood me. Yes the OS should behave better than this but if you are running a server you don't want your DB, web server to not be reachable because the system run out of memory - the only way to "fix" that is to provide enough resources. No amount of OOM killing would help you here. The system may be up but not the server process the machine is running for ...
Perhaps, but two points:
a. this feature is for Workstation. If the Server working group wants to give it a go, that's up to them. But they may prefer experimenting with more server oriented user space oom daemons like recent versions of oomd. And for that use case, Facebook (and others) have investigated this and find that avoiding OOM even by process killing, is far less bad than the system hanging itself. As in better for recovery and better for limited sysadmin resources. There's a video about it from the recent All Systems Go conference.
b. earlyoom does SIGTERM first, I have yet to see a single process (hundreds of tests, but that's really nothing, and also not a scientific sample) that doesn't respond to SIGTERM, where SIGKILL is needed.
And btw we should really update the minimum memory requirements in our documentation, the current ones have nothing to do with reality (if you want a pleasant user experience).
Can you be more specific?
On getfedora.org it reads: Fedora requires a minimum of 20GB disk, 2GB RAM, to install and run successfully. Double those amounts is recommended.
I simply do not think 2GB is sufficient, the "recommended double" i.e 4GB should be the "required" and drop the double part all together. A modern desktop with apps on top will not run well enough on 2GB, lets stop pretending it does. But anyways that's off topic as it is not part of the proposal.
Workstation working group recently bumped this from 1G minimum, 2G recommended. We're considering VM's with these numbers. And comparative point of reference, Windows 10 64-bit is also 2G minimum.
On Saturday, January 4, 2020 2:29:11 PM MST drago01 wrote:
A modern desktop with apps on top will not run well enough on 2GB, lets stop pretending it does.
This is simply not the case. It may be for GNOME, but I haven't tested that. It definitely is not the case for Plasma.
John M. Harris Jr wrote:
This is simply not the case. It may be for GNOME, but I haven't tested that. It definitely is not the case for Plasma.
… unless you want to run KMail/Akonadi on it. :-)
But yes, Plasma itself works fine with 2 GiB (I haven't actually tested with less than 4 GiB, but you wrote you did and I believe you there), most applications should work too, and if you need an e-mail client, you can run a lightweight one such as Trojitá (Qt-based, fast, and requires less than 50 MiB exclusive memory here, with over 50000 mails in my IMAP inbox – if I start scrolling through the inbox, it loads old metadata on demand, growing the memory usage to still less than 100 MiB).
And my Core 2 Duo with 4 GiB RAM definitely works fine with Plasma (desktop environment), Falkon (web browser), Trojitá (e-mail client), Krusader (file manager), etc.
Kevin Kofler
I think this would be a really big improvement for workstation and other desktop spins, the handling of out of memory situations have been a consistent paint point on Linux. However, may I ask why EarlyOOM was chosen over something like NoHang [1]? I am a bit concerned that EarlyOOM's heuristics may be too coarse, as it does not take into account the newly-added PSI metrics [2][3] that other projects like NoHang, oomd, and low-memory-monitor utilize. For example, if the system is thrashing, but swap is not full, to my knowledge EarlyOOM will not see a problem, however it would be visible via PSI.
To be clear, I'd rather have something in time for 32 to improve OOM handling than wait several release cycles for the ideal solution to be ready. I'm simply curious about what problems, if any, were encountered with the other potential candidates.
[1] https://github.com/hakavlad/nohang [2] https://facebookmicrosites.github.io/psi/docs/overview [3] https://www.kernel.org/doc/html/latest/accounting/psi.html
On Fri, Jan 3, 2020 at 4:13 PM Tom Seewald tseewald@gmail.com wrote:
I think this would be a really big improvement for workstation and other desktop spins, the handling of out of memory situations have been a consistent paint point on Linux. However, may I ask why EarlyOOM was chosen over something like NoHang [1]?
The developer/maintainer of nohang (hakavlad) recommended earlyoom.
I am a bit concerned that EarlyOOM's heuristics may be too coarse, as it does not take into account the newly-added PSI metrics [2][3] that other projects like NoHang, oomd, and low-memory-monitor utilize. For example, if the system is thrashing, but swap is not full, to my knowledge EarlyOOM will not see a problem, however it would be visible via PSI.
Yep, I wonder about this myself and mention it in the two issues we're tracking. It's likely true that there will be workloads where earlyoom doesn't help, but equally important is it doesn't make things worse (or really weird). It's a complicated problem, and part of the challenge is how to go about making an incremental improvement while avoiding regressions.
We also have another issue, #120, to look at making swap smaller, which absolutely exacerbates the swap thrashing problem. Once a system is swap thrashing, responsiveness is already cratering.
To be clear, I'd rather have something in time for 32 to improve OOM handling than wait several release cycles for the ideal solution to be ready. I'm simply curious about what problems, if any, were encountered with the other potential candidates.
Those discussions you'll find in #98, including oomd and low-memory-monitor. https://pagure.io/fedora-workstation/issue/98#comment-590690
On Fri, Jan 3, 2020 at 11:12 pm, Tom Seewald tseewald@gmail.com wrote:
I think this would be a really big improvement for workstation and other desktop spins, the handling of out of memory situations have been a consistent paint point on Linux. However, may I ask why EarlyOOM was chosen over something like NoHang [1]? I am a bit concerned that EarlyOOM's heuristics may be too coarse, as it does not take into account the newly-added PSI metrics [2][3] that other projects like NoHang, oomd, and low-memory-monitor utilize. For example, if the system is thrashing, but swap is not full, to my knowledge EarlyOOM will not see a problem, however it would be visible via PSI.
We've been working closely with Alexey, the maintainer of nohang, on this proposal. He has recommended using either earlyoom or nohang as the two best choices over other available options (e.g. oomd, or low-memory-monitor). I'm not completely certain why earlyoom was chosen over nohang, but I think simplicity and code maturity were likely important considerations in the final choice.
nohang has experimented with PSI, but it actually isn't using PSI metrics by default because they've proven to be less effective than hoped for. In theory, using an interactivity measure like PSI should provide for the best results, but in practice it just hasn't worked out well.
In our experiments, low-memory-monitor is currently significantly worse at handling OOM conditions (as has been noted elsewhere in this thread). Although we're likely to enable low-memory-monitor in Workstation, we would use it only for advisory memory pressure notifications (GMemoryMonitor).
Michael
Michael Catanzaro wrote:
nohang has experimented with PSI, but it actually isn't using PSI metrics by default because they've proven to be less effective than hoped for. In theory, using an interactivity measure like PSI should provide for the best results, but in practice it just hasn't worked out well.
I think this really needs to be handled entirely in the kernel to be effective, because if the interactivity is already down the drain, your userspace PSI monitor will not get to run at all in a reasonable timeframe.
I think that to ensure interactivity, the kernel needs to synchronously check the interactivity metrics each and every time it gets a swap-in request, and fail the request (and kill the process, most likely) if the requesting process is known to hurt interactivity too much with its previous requests. Anything asynchronous will just not work, because asynchronous event handlers stop working when the interactivity is too poor.
Kevin Kofler
On 03.01.2020 20:18, Ben Cotton wrote:
Workstation working group has discussed "better interactivity in low-memory situations" for some months. In certain use cases, typically compiling, if all RAM and swap are completely consumed, system responsiveness becomes so abysmal that a reasonable user can consider the system "lost", and resorts to forcing a power off. This is objective a very bad UX.
I'm strongly against adding of any user-space OOM killers to Fedora default images. Users should explicitly enable them only when needed.
1. Such applications run with super-user privileges and has full access to all private memory of all processes and sensitive user data. This is a huge security breach.
2. It can easily kill KDE/Gnome shell or VM hypervisor, which can cause data loss.
3. Some implementations are killing all processes with the same name[1] and their developers think that this is a feature.
4. Super-user daemons should not touch userspace at all. A real user-space OOM should run with privileges of the same user using system-user units.
[1]: https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/8
В Суб, 04/01/2020 в 08:27 +0100, Vitaly Zaitsev via devel пишет:
I'm strongly against adding of any user-space OOM killers to Fedora default images. Users should explicitly enable them only when needed.
Just my 2 cents: i tested early versions of earlyoom and have weird experience with it: it killing not Chromium or Chromium processes, instead it killing tiny processes which it shouldn't, probably. I guess it could kill dnf process as well easily.
I am skeptically too about enabling such things by default, but in same time would be nice to massively test this.
On Sat, Jan 4, 2020 at 12:45 AM ego.cordatus@gmail.com wrote:
В Суб, 04/01/2020 в 08:27 +0100, Vitaly Zaitsev via devel пишет:
I'm strongly against adding of any user-space OOM killers to Fedora default images. Users should explicitly enable them only when needed.
Just my 2 cents: i tested early versions of earlyoom and have weird experience with it: it killing not Chromium or Chromium processes, instead it killing tiny processes which it shouldn't, probably. I guess it could kill dnf process as well easily.
I am skeptically too about enabling such things by default, but in same time would be nice to massively test this.
earlyoom uses oom_score to determine the victim process to SIGTERM and SIGKILL. The same metric used by the kernel oom-killer. I too have seen inexplicable kernel oom-killer invoked on processes that should not be targets: sssd, sshd, and even once systemd-journald. This is very weird and I don't have an explanation why any process with a score of 0 is getting killed before the dozens of processes with a score much higher, and yet I've seen it. It's suspicious.
The nice thing about earlyoom, even though it's a hammer? It's a small hammer. It's not going to go on a wrecking ball spree. It can, and likely will, be backed out as other solutions become more useful. And the documentation reflects its oversimplification of a complex problem.
Seems like this bug:
**Kills multiple processes at once** https://github.com/rfjakob/earlyoom/issues/121
but according to github it's fixed now.
On Fri, 3 Jan 2020, 20:19 Ben Cotton, bcotton@redhat.com wrote:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
Since in the Change we are not introducing just the earlyoom tool but enable it with a specific profile I would add those details here. Smth like:
"earlyoom service will choose the offending process based on the same oom_score as kernel uses. It will send a SIGTERM signal on 10% of RAM left, and SIGKILL on 5%"
== Owner ==
- Name: [[User:chrismurphy| Chris Murphy]]
- Email: bugzilla@colorremedies.com
== Detailed Description == Workstation working group has discussed "better interactivity in low-memory situations" for some months. In certain use cases, typically compiling, if all RAM and swap are completely consumed, system responsiveness becomes so abysmal that a reasonable user can consider the system "lost", and resorts to forcing a power off. This is objective a very bad UX. The broad discussion of this problem, and some ideas for near term and long term solutions, is located here:
Recent long discussions on "Better interactivity in low-memory situations"<br> https://pagure.io/fedora-workstation/issue/98<br>
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/...
<br>
Fedora editions and spins, have the in-kernel OOM (out-of-memory) manager enabled. The manager's concern is keeping the kernel itself functioning. It has no concern about user space function or interactivity. This proposed change attempts to improve the user experience, in the short term, by triggering the in-kernel process killing mechanism, sooner. Instead of the system becoming completely unresponsive for tens of minutes, hours or days, the expectation is an offending process (determined by oom_score, same as now) will be killed off within seconds or a few minutes. This is an incremental improvement in user experience, but admittedly still suboptimal. There is additional work on-going to improve the user experience further.
Workstation working group discussion specific to enabling earlyoom by default https://pagure.io/fedora-workstation/issue/119
Other in-progress solutions:<br> https://gitlab.freedesktop.org/hadess/low-memory-monitor<br>
Background information on this complicated problem:<br> https://www.kernel.org/doc/gorman/html/understand/understand016.html<br> https://lwn.net/Articles/317814/<br>
== Benefit to Fedora ==
There are two major benefits to Fedora:
- improved user experience by more quickly regaining control over
one's system, rather than having to force power off in low-memory situations where there's aggressive swapping. Once a system becomes unresponsive, it's completely reasonable for the user to assume the system is lost, but that includes high potential for data loss.
- reducing forced poweroff as the main work around will increase data
collection, improving understanding of low memory situations and how to handle them better
As I understand in the current setup we are looking more for a controlled failure scenario rather than for a solution.
Can we get a specific manual, what users supposed to do, once they trigger the earlyoom? Does earlyoom help in reporting? Which logs we need to look at?
Maybe add a section in UX part of the change, or setup a dedicated wiki page?
== Scope ==
- Proposal owners:
a. Modify {{code| https://pagure.io/fedora-comps/blob/master/f/comps-f32.xml.in%7D%7D to include earlyoom package for Workstation.<br> b. Modify {{code| https://src.fedoraproject.org/rpms/fedora-release/blob/master/f/80-workstati... }} to include:
<pre> # enable earlyoom by default on workstation enable earlyoom.service </pre>
- Other developers:
Restricted to Workstation edition, unless other editions/spins want to opt-in.
- Release engineering: [https://pagure.io/releng/issues #9141] (a
check of an impact with Release Engineering is needed) <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
- Policies and guidelines: N/A
- Trademark approval: N/A
== Upgrade/compatibility impact == earlyoom.service will be enabled on upgrade. An upgraded system should exhibit the same behaviors as a clean installed system.
== How To Test ==
- Fedora 30/31 users can test today, any edition or spin:<br>
{{code|sudo dnf install earlyoom}}<br> {{code|sudo systemctl enable --now earlyoom}}
And then attempt to cause an out of memory situation. Examples:<br> {{code|tail /dev/zero}}<br> {{code|https://lkml.org/lkml/2019/8/4/15%7D%7D
- Fedora Workstation 32 (and Rawhide) users will see this service is
already enabled. It can be toggled with {{code|sudo systemctl start/stop earlyoom}} where start means earlyoom is running, and stop means earlyoom is not running.
== User Experience == The most egregious instances this change is trying to mitigate: a. RAM is completely used b. Swap is completely used c. System becomes unresponsive to the user as swap thrashing has ensued --> earlyoom disabled, the user often gives up and forces power off (in my own testing this condition lasts >30 minutes with no kernel triggered oom killer and no recovery) --> earlyoom enabled, the system likely still becomes unresponsive but oom killer is triggered in much less time (seconds or a few minutes, in my testing, after less than 10% RAM and 10% swap is remaining)
earlyoom starts sending SIGTERM once both memory and swap are below their respective PERCENT setting, default 10%. It sends SIGKILL once both are below their respective KILL_PERCENT setting, default 5%.
The package includes configuration file /etc/default/earlyoom which sets option {{code|-r 60}} causing a memory report to be entered into the journal every minute.
== Dependencies == earlyoom package has no dependencies
== Contingency Plan ==
- Contingency mechanism: Owner will revert all changes
- Contingency deadline: Final freeze
- Blocks release? No
- Blocks product? No
== Documentation == {{code|man earlyoom}}<br><br> https://www.kernel.org/doc/gorman/html/understand/understand016.html
== Release Notes == Earlyoom service is enabled by default, which will cause kernel oom-killer to trigger sooner. To revert to previous behavior:<br> {{code|sudo systemctl disable earlyoom.service}}
And to customize see {{code|man earlyoom}}.
Additionally, there was a question during the chat discussion: how the earlyoom setup will work together with OOMPolicy and any other related options of systemd units? Will systemd recognize the OOM event?
On Sat, Jan 4, 2020 at 2:51 AM Aleksandra Fedorova alpha@bookwar.info wrote:
Since in the Change we are not introducing just the earlyoom tool but enable it with a specific profile I would add those details here. Smth like:
"earlyoom service will choose the offending process based on the same oom_score as kernel uses. It will send a SIGTERM signal on 10% of RAM left, and SIGKILL on 5%"
I add this information to the summary. Also, I think these numbers may need to change to avoid prematurely sending SIGTERM when the system has no swap device.
As I understand in the current setup we are looking more for a controlled failure scenario rather than for a solution.
Yes, it's fair to say this proposal is to make things "less bad". It doesn't improve system responsiveness. Once heavy swap starts, the system is sluggish, stutters, and briefly stalls. This proposal doesn't fix that. There is a lot of room for improvement.
Can we get a specific manual, what users supposed to do, once they trigger the earlyoom? Does earlyoom help in reporting? Which logs we need to look at?
Maybe add a section in UX part of the change, or setup a dedicated wiki page?
The user shouldn't need to do anything differently than if the kernel oom-killer had triggered. The system journal will contain messages showing what was killed and why:
Jan 04 16:05:42 fmac.local earlyoom[4896]: low memory! at or below SIGTERM limits: mem 10 %, swap 10 % Jan 04 16:05:42 fmac.local earlyoom[4896]: sending SIGTERM to process 27421 "chrome": badness 305, VmRSS 42 MiB
Additionally, there was a question during the chat discussion: how the earlyoom setup will work together with OOMPolicy and any other related options of systemd units? Will systemd recognize the OOM event?
My understanding of systemd OOMPolicy= behavior, is it looks for the kernel's oom-killer messages and acts upon those. Whereas earlyoom uses the same metric (oom_score) as the oom-killer, it does not invoke the oom-killer. Therefore systemd probably does not get the proper hint to implement OOMPolicy=
Fedora need to discuss how big of a problem that is, if there's anyway to mitigate it, or tolerate it, weighing the pros of earlyoom for a short period, versus the cons of punting this problem for another release. This proposal does not intend to step on other superseding work in this area, but if it does, it'll be withdrawn.
On Sat, Jan 04, 2020 at 04:38:19PM -0700, Chris Murphy wrote:
On Sat, Jan 4, 2020 at 2:51 AM Aleksandra Fedorova alpha@bookwar.info wrote:
Since in the Change we are not introducing just the earlyoom tool but enable it with a specific profile I would add those details here. Smth like:
"earlyoom service will choose the offending process based on the same oom_score as kernel uses. It will send a SIGTERM signal on 10% of RAM left, and SIGKILL on 5%"
I add this information to the summary. Also, I think these numbers may need to change to avoid prematurely sending SIGTERM when the system has no swap device.
As I understand in the current setup we are looking more for a controlled failure scenario rather than for a solution.
Yes, it's fair to say this proposal is to make things "less bad". It doesn't improve system responsiveness. Once heavy swap starts, the system is sluggish, stutters, and briefly stalls. This proposal doesn't fix that. There is a lot of room for improvement.
Can we get a specific manual, what users supposed to do, once they trigger the earlyoom? Does earlyoom help in reporting? Which logs we need to look at?
Maybe add a section in UX part of the change, or setup a dedicated wiki page?
The user shouldn't need to do anything differently than if the kernel oom-killer had triggered. The system journal will contain messages showing what was killed and why:
Jan 04 16:05:42 fmac.local earlyoom[4896]: low memory! at or below SIGTERM limits: mem 10 %, swap 10 % Jan 04 16:05:42 fmac.local earlyoom[4896]: sending SIGTERM to process 27421 "chrome": badness 305, VmRSS 42 MiB
Additionally, there was a question during the chat discussion: how the earlyoom setup will work together with OOMPolicy and any other related options of systemd units? Will systemd recognize the OOM event?
My understanding of systemd OOMPolicy= behavior, is it looks for the kernel's oom-killer messages and acts upon those. Whereas earlyoom uses the same metric (oom_score) as the oom-killer, it does not invoke the oom-killer. Therefore systemd probably does not get the proper hint to implement OOMPolicy=
Yes. The kernel reports oom events in the cgroup file memory.events, and systemd waits for an inotify event on that file; OOMPolicy=stop is implemented that way. And the OOMPolicy=kill option is "implemented" by setting memory.oom.group=1 in the kernel [1] and having the kernel kill all the processes. So systemd is providing a thin wrapper around the kernel functionality.
If processes are not killed by the kernel but through a signal from userspace, all of this will not work.
[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-int...
Zbyszek
Fedora need to discuss how big of a problem that is, if there's anyway to mitigate it, or tolerate it, weighing the pros of earlyoom for a short period, versus the cons of punting this problem for another release. This proposal does not intend to step on other superseding work in this area, but if it does, it'll be withdrawn.
On Sun, Jan 5, 2020 at 10:18 AM Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl wrote:
On Sat, Jan 04, 2020 at 04:38:19PM -0700, Chris Murphy wrote:
On Sat, Jan 4, 2020 at 2:51 AM Aleksandra Fedorova alpha@bookwar.info wrote:
Since in the Change we are not introducing just the earlyoom tool but enable it with a specific profile I would add those details here. Smth like:
"earlyoom service will choose the offending process based on the same oom_score as kernel uses. It will send a SIGTERM signal on 10% of RAM left, and SIGKILL on 5%"
I add this information to the summary. Also, I think these numbers may need to change to avoid prematurely sending SIGTERM when the system has no swap device.
As I understand in the current setup we are looking more for a controlled failure scenario rather than for a solution.
Yes, it's fair to say this proposal is to make things "less bad". It doesn't improve system responsiveness. Once heavy swap starts, the system is sluggish, stutters, and briefly stalls. This proposal doesn't fix that. There is a lot of room for improvement.
Can we get a specific manual, what users supposed to do, once they trigger the earlyoom? Does earlyoom help in reporting? Which logs we need to look at?
Maybe add a section in UX part of the change, or setup a dedicated wiki page?
The user shouldn't need to do anything differently than if the kernel oom-killer had triggered. The system journal will contain messages showing what was killed and why:
Jan 04 16:05:42 fmac.local earlyoom[4896]: low memory! at or below SIGTERM limits: mem 10 %, swap 10 % Jan 04 16:05:42 fmac.local earlyoom[4896]: sending SIGTERM to process 27421 "chrome": badness 305, VmRSS 42 MiB
Additionally, there was a question during the chat discussion: how the earlyoom setup will work together with OOMPolicy and any other related options of systemd units? Will systemd recognize the OOM event?
My understanding of systemd OOMPolicy= behavior, is it looks for the kernel's oom-killer messages and acts upon those. Whereas earlyoom uses the same metric (oom_score) as the oom-killer, it does not invoke the oom-killer. Therefore systemd probably does not get the proper hint to implement OOMPolicy=
Yes. The kernel reports oom events in the cgroup file memory.events, and systemd waits for an inotify event on that file; OOMPolicy=stop is implemented that way. And the OOMPolicy=kill option is "implemented" by setting memory.oom.group=1 in the kernel [1] and having the kernel kill all the processes. So systemd is providing a thin wrapper around the kernel functionality.
If processes are not killed by the kernel but through a signal from userspace, all of this will not work.
I grepped /usr/lib/systemd and /etc/systemd for "OOM" on my workstation and it seems that we have only OOMScoreAdjust option used in the installed systemd units. And this option will be respected by earlyoom.
Since on workstation we don't use tweaking of the OOMPolicy on the unit level, I'd say we can leave the tweaking to the system administrators: when there is need to adjust OOMPolicy of a service, administrators would need to tweak or disable earlyoom service as well.
But I'd like to understand better the difference between _default_ OOM-event and _default_ earlyoom-event:
Afaik DefaultOOMPolicy is set to "stop", which means if one of the processes in the service is killed by OOM, other processes from the same service are gracefully stopped by systemd.
What is the default behavior of the systemd service on external SIGTERM/SIGKILL signal sent to the process by earlyoom?
[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-int...
Zbyszek
On Sun, Jan 05, 2020 at 12:29:40PM +0100, Aleksandra Fedorova wrote:
On Sun, Jan 5, 2020 at 10:18 AM Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl wrote:
On Sat, Jan 04, 2020 at 04:38:19PM -0700, Chris Murphy wrote:
On Sat, Jan 4, 2020 at 2:51 AM Aleksandra Fedorova alpha@bookwar.info wrote:
Since in the Change we are not introducing just the earlyoom tool but enable it with a specific profile I would add those details here. Smth like:
"earlyoom service will choose the offending process based on the same oom_score as kernel uses. It will send a SIGTERM signal on 10% of RAM left, and SIGKILL on 5%"
I add this information to the summary. Also, I think these numbers may need to change to avoid prematurely sending SIGTERM when the system has no swap device.
As I understand in the current setup we are looking more for a controlled failure scenario rather than for a solution.
Yes, it's fair to say this proposal is to make things "less bad". It doesn't improve system responsiveness. Once heavy swap starts, the system is sluggish, stutters, and briefly stalls. This proposal doesn't fix that. There is a lot of room for improvement.
Can we get a specific manual, what users supposed to do, once they trigger the earlyoom? Does earlyoom help in reporting? Which logs we need to look at?
Maybe add a section in UX part of the change, or setup a dedicated wiki page?
The user shouldn't need to do anything differently than if the kernel oom-killer had triggered. The system journal will contain messages showing what was killed and why:
Jan 04 16:05:42 fmac.local earlyoom[4896]: low memory! at or below SIGTERM limits: mem 10 %, swap 10 % Jan 04 16:05:42 fmac.local earlyoom[4896]: sending SIGTERM to process 27421 "chrome": badness 305, VmRSS 42 MiB
Additionally, there was a question during the chat discussion: how the earlyoom setup will work together with OOMPolicy and any other related options of systemd units? Will systemd recognize the OOM event?
My understanding of systemd OOMPolicy= behavior, is it looks for the kernel's oom-killer messages and acts upon those. Whereas earlyoom uses the same metric (oom_score) as the oom-killer, it does not invoke the oom-killer. Therefore systemd probably does not get the proper hint to implement OOMPolicy=
Yes. The kernel reports oom events in the cgroup file memory.events, and systemd waits for an inotify event on that file; OOMPolicy=stop is implemented that way. And the OOMPolicy=kill option is "implemented" by setting memory.oom.group=1 in the kernel [1] and having the kernel kill all the processes. So systemd is providing a thin wrapper around the kernel functionality.
If processes are not killed by the kernel but through a signal from userspace, all of this will not work.
I grepped /usr/lib/systemd and /etc/systemd for "OOM" on my workstation and it seems that we have only OOMScoreAdjust option used in the installed systemd units. And this option will be respected by earlyoom.
Since on workstation we don't use tweaking of the OOMPolicy on the unit level, I'd say we can leave the tweaking to the system administrators: when there is need to adjust OOMPolicy of a service, administrators would need to tweak or disable earlyoom service as well.
Having "conflicts" between things, in the sense that using one feature means that another feature needs to be disabled, is always an option. But it's never a very good option. I think that it isn't too important to keep OOMPolicy= working, since its a new and relatively unused thing. Nevertheless, it would be nice to find a way to avoid this and support both features at the same time. This thread 'til now is mostly about establishing whether there really is a conflict (it seems yes) and whether there is some easy way to avoid it (not sure yet...). I think we should explore that before settling on the easy but suboptimal answer.
But I'd like to understand better the difference between _default_ OOM-event and _default_ earlyoom-event:
Afaik DefaultOOMPolicy is set to "stop", which means if one of the processes in the service is killed by OOM, other processes from the same service are gracefully stopped by systemd.
What is the default behavior of the systemd service on external SIGTERM/SIGKILL signal sent to the process by earlyoom?
It depends on which of the processes is killed. If the main process is killed with SIGTERM, systemd kill consider this a normal successful termination. If the main process is killed with SIGKILL, systemd will consider this a failure. (Both of those cases modified by SuccessExitStatus=.) If some random subprocess is killed, systemd will not care at all. So in general, just killing a subprocess with SIGTERM results at least in systemd reporting successful termination when it shouldn't.
Zbyszek
On Sun, Jan 5, 2020 at 2:18 AM Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl wrote:
On Sat, Jan 04, 2020 at 04:38:19PM -0700, Chris Murphy wrote:
My understanding of systemd OOMPolicy= behavior, is it looks for the kernel's oom-killer messages and acts upon those. Whereas earlyoom uses the same metric (oom_score) as the oom-killer, it does not invoke the oom-killer. Therefore systemd probably does not get the proper hint to implement OOMPolicy=
Yes. The kernel reports oom events in the cgroup file memory.events, and systemd waits for an inotify event on that file; OOMPolicy=stop is implemented that way. And the OOMPolicy=kill option is "implemented" by setting memory.oom.group=1 in the kernel [1] and having the kernel kill all the processes. So systemd is providing a thin wrapper around the kernel functionality.
If processes are not killed by the kernel but through a signal from userspace, all of this will not work.
The gotcha on the desktop with kernel oom-killer, is that if it's needed, it's way past too late. And it may never trigger.
The central problem to be solved isn't even what does OOM killing or when: the ridiculously bad system responsiveness during heavy swap usage.
My top criticism of the feature proposal is that it doesn't address the responsivity problem head on. It just reduces the duration of badness. And the reduction isn't near enough.
One thing that helps the heavy swap problem, today? A much smaller swap partition. In fact, no swap partition alleviates the problem entirely, but of course that has other consequences (that the working group is discussing in #120).
On Sun, Jan 5, 2020 at 12:39 AM Chris Murphy lists@colorremedies.com wrote:
On Sat, Jan 4, 2020 at 2:51 AM Aleksandra Fedorova alpha@bookwar.info wrote:
Since in the Change we are not introducing just the earlyoom tool but enable it with a specific profile I would add those details here. Smth like:
"earlyoom service will choose the offending process based on the same oom_score as kernel uses. It will send a SIGTERM signal on 10% of RAM left, and SIGKILL on 5%"
I add this information to the summary. Also, I think these numbers may need to change to avoid prematurely sending SIGTERM when the system has no swap device.
As I understand in the current setup we are looking more for a controlled failure scenario rather than for a solution.
Yes, it's fair to say this proposal is to make things "less bad". It doesn't improve system responsiveness. Once heavy swap starts, the system is sluggish, stutters, and briefly stalls. This proposal doesn't fix that. There is a lot of room for improvement.
Can we get a specific manual, what users supposed to do, once they trigger the earlyoom? Does earlyoom help in reporting? Which logs we need to look at?
Maybe add a section in UX part of the change, or setup a dedicated wiki page?
The user shouldn't need to do anything differently than if the kernel oom-killer had triggered. The system journal will contain messages showing what was killed and why:
Jan 04 16:05:42 fmac.local earlyoom[4896]: low memory! at or below SIGTERM limits: mem 10 %, swap 10 % Jan 04 16:05:42 fmac.local earlyoom[4896]: sending SIGTERM to process 27421 "chrome": badness 305, VmRSS 42 MiB
I wonder, how I as a user going to be informed about the earlyoom-event? I assume abrt will recognize the crash? Will it be easily visible from the abrt report that it was the OOM?
The concern is: if we enable such a service, will we get large amount of vague bug reports from users who don't understand what has happened. Can we make it somehow easier to debug?
Additionally, there was a question during the chat discussion: how the earlyoom setup will work together with OOMPolicy and any other related options of systemd units? Will systemd recognize the OOM event?
My understanding of systemd OOMPolicy= behavior, is it looks for the kernel's oom-killer messages and acts upon those. Whereas earlyoom uses the same metric (oom_score) as the oom-killer, it does not invoke the oom-killer. Therefore systemd probably does not get the proper hint to implement OOMPolicy=
Fedora need to discuss how big of a problem that is, if there's anyway to mitigate it, or tolerate it, weighing the pros of earlyoom for a short period, versus the cons of punting this problem for another release. This proposal does not intend to step on other superseding work in this area, but if it does, it'll be withdrawn.
-- Chris Murphy _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Sun, Jan 5, 2020 at 4:43 AM Aleksandra Fedorova alpha@bookwar.info wrote:
I wonder, how I as a user going to be informed about the earlyoom-event?
Same as a kernel oom-killer event. Primary source is the journal.
But well before either earlyoom sends SIGTERM or kernel oom-killer kills something, the user will know something is wrong, because system responsivity will be stuttering or even already intermittently hanging. Earlyoom is not aggressively clobbering things, except for system configuration that have no swap device. That configuration needs some earlyoom tweaking, probably, and we're looking at that, but then those folks also aren't experiencing much reduced system responsiveness in these cases because their system can't heavily swap.
I assume abrt will recognize the crash? Will it be easily visible from the abrt report that it was the OOM?
No. It's not a crash. Earlyoom sends SIGTERM first, and only sends SIGKILL if the process isn't responding in time to SIGTERM. And the kernel oom-killer also doesn't result in an abrt report.
The concern is: if we enable such a service, will we get large amount of vague bug reports from users who don't understand what has happened. Can we make it somehow easier to debug?
Unless further real world testing uncovers something very new and different from my testing (entirely possible, but I can't estimate that probability), there won't be a measurable increase in bug reports related to this.
Based on my limited testing (I've done around 200+ tests of oom-killer, earlyoom SIGTERM (never have seen a SIGKILL), and nohang; and perhaps 80 of those tests involved forced power off during heavy swap, compile and system use) there really isn't anything that requires the user to get involved.
Also, there isn't a per se bug here. It's a series of intentional designs that are colliding together in a deeply problematic user experience for the desktop: that the "operating system", i.e. Fedora Workstation providing kernel, systemd, a bunch of services, libraries, policies - permits an unprivileged process to ask for essentially unlimited resources and overcommit the system *and* then heavy swap use results in compromised system responsiveness and control.
Earlyoom doesn't change any of that, it just selects a process for SIGTERM much sooner than the kernel oom-killer. And that only stops the bad experience, by stopping the resource hogging process. It isn't actually fixing anything. It is in some sense an act of desperation, that's been a long time coming. Arguably, earlyoom isn't aggressive enough, doesn't stop the badness soon enough.
On Sun, Jan 5, 2020 at 12:43 PM Aleksandra Fedorova alpha@bookwar.info wrote:
I wonder, how I as a user going to be informed about the earlyoom-event? I assume abrt will recognize the crash? Will it be easily visible from the abrt report that it was the OOM?
The concern is: if we enable such a service, will we get large amount of vague bug reports from users who don't understand what has happened. Can we make it somehow easier to debug?
FWIW, the behavior on Android is very close to what is proposed here. If your application exceeds the amount of available memory, it simply closes right in front of your eyes. No explanation, nothing, it's just gone (might be different on latest Android versions). The same thing would happen with EarlyOOM - some application would disappear.
I agree it would be nice to inform the user before or at least after. Windows can do it - they show a notification roughly saying "Your system is running out of memory and some application might get closed". (At least they used to in the old days, I haven't run out of memory for a long time, and I don't know whether Windows 10 behaves the same way). But I think it should not be a stopper for the proposal as it is. Even without the notification the user experience is improved over the default behavior.
On 2020-01-06 18:31, Kamil Paral wrote:
FWIW, the behavior on Android is very close to what is proposed here. If your application exceeds the amount of available memory, it simply closes right in front of your eyes. No explanation, nothing, it's just gone (might be different on latest Android versions). The same thing would happen with EarlyOOM - some application would disappear.
The analogy is not completely fair. On Android applications are designed to be Started and Stopped by the system, and they are supposed to save their entire state so that when restarted nothing has apparently happened, from the point of view of the user. (many applications are badly written, but that's another story...) And we are talking about background applications, on a system where only one application is in foreground (only very recently you can have two applications in foreground). Finally, it is the applications that are stopped (by asking them nicely trough an event), not general system processes; Android would never kill a wpa_supplicant process, for example.
Android has a concept of "cache" of background applications, they are there, if possible, just to have them back very quickly; it is similar to how Linux keeps dirty disk content in RAM and pushes it to disk when RAM must be freed.
Regards.
On Mon, Jan 6, 2020 at 8:52 PM Roberto Ragusa mail@robertoragusa.it wrote:
On 2020-01-06 18:31, Kamil Paral wrote:
FWIW, the behavior on Android is very close to what is proposed here. If
your application exceeds the amount of available memory, it simply closes right in front of your eyes. No explanation, nothing, it's just gone (might be different on latest Android versions). The same thing would happen with EarlyOOM - some application would disappear.
The analogy is not completely fair. On Android applications are designed to be Started and Stopped by the system, and they are supposed to save their entire state so that when restarted nothing has apparently happened, from the point of view of the user. (many applications are badly written, but that's another story...) And we are talking about background applications, on a system where only one application is in foreground (only very recently you can have two applications in foreground). Finally, it is the applications that are stopped (by asking them nicely trough an event), not general system processes; Android would never kill a wpa_supplicant process, for example.
Android has a concept of "cache" of background applications, they are there, if possible, just to have them back very quickly; it is similar to how Linux keeps dirty disk content in RAM and pushes it to disk when RAM must be freed.
Sure, Android is quite a different world, I'm not saying the comparison applies 1:1. But the end-effect is similar - a window just disappears. And it's even less obvious why it happened, because you don't have any swap and therefore you haven't gone through any performance degradation. That's all I wanted to note. (Also, I'm also skeptical about the app saving state before being killed, because it already went out of memory and can't function properly. But let's not go off-topic.)
On Mon, 6 Jan 2020, 18:32 Kamil Paral, kparal@redhat.com wrote:
On Sun, Jan 5, 2020 at 12:43 PM Aleksandra Fedorova alpha@bookwar.info wrote:
I wonder, how I as a user going to be informed about the earlyoom-event? I assume abrt will recognize the crash? Will it be easily visible from the abrt report that it was the OOM?
The concern is: if we enable such a service, will we get large amount of vague bug reports from users who don't understand what has happened. Can we make it somehow easier to debug?
FWIW, the behavior on Android is very close to what is proposed here. If your application exceeds the amount of available memory, it simply closes right in front of your eyes. No explanation, nothing, it's just gone (might be different on latest Android versions). The same thing would happen with EarlyOOM - some application would disappear.
I agree it would be nice to inform the user before or at least after. Windows can do it - they show a notification roughly saying "Your system is running out of memory and some application might get closed". (At least they used to in the old days, I haven't run out of memory for a long time, and I don't know whether Windows 10 behaves the same way). But I think it should not be a stopper for the proposal as it is. Even without the notification the user experience is improved over the default behavior.
I am not convinced that it is an improvement to be honest.
UX before: system works, I run heavy application, system starts to hang, i understand that there is an issue, i can kill the app or reboot, which gives me clean and working system.
UX after: system works, no visible problems. Suddenly random app disappears, no errors or crashes reported to me. It might be that my active app is killed, then I know that smth happened, but what if background process is killed? Maybe my messenger app?
I am going to keep working in my main app without noticing that I lost something, not knowing that I need to take action. And my system now runs in a weird state, and can stay there for days, which will lead to more weird and nonreproducible errors later.
The "hang" of a system was the feedback user got from the system that there is something wrong. Not ideal, but at least there was something. With this feature we don't solve the issue, we remove the "bad" feedback, and we don't provide any replacement for it making memory problem completely invisible.
Is it really a good UX?
_______________________________________________
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Tue, Jan 7, 2020 at 12:08 AM Aleksandra Fedorova alpha@bookwar.info wrote:
On Mon, 6 Jan 2020, 18:32 Kamil Paral, kparal@redhat.com wrote:
On Sun, Jan 5, 2020 at 12:43 PM Aleksandra Fedorova alpha@bookwar.info wrote:
I wonder, how I as a user going to be informed about the earlyoom-event? I assume abrt will recognize the crash? Will it be easily visible from the abrt report that it was the OOM?
The concern is: if we enable such a service, will we get large amount of vague bug reports from users who don't understand what has happened. Can we make it somehow easier to debug?
FWIW, the behavior on Android is very close to what is proposed here. If your application exceeds the amount of available memory, it simply closes right in front of your eyes. No explanation, nothing, it's just gone (might be different on latest Android versions). The same thing would happen with EarlyOOM - some application would disappear.
I agree it would be nice to inform the user before or at least after. Windows can do it - they show a notification roughly saying "Your system is running out of memory and some application might get closed". (At least they used to in the old days, I haven't run out of memory for a long time, and I don't know whether Windows 10 behaves the same way). But I think it should not be a stopper for the proposal as it is. Even without the notification the user experience is improved over the default behavior.
I am not convinced that it is an improvement to be honest.
UX before: system works, I run heavy application, system starts to hang, i understand that there is an issue, i can kill the app or reboot, which gives me clean and working system.
UX after: system works, no visible problems. Suddenly random app disappears, no errors or crashes reported to me. It might be that my active app is killed, then I know that smth happened, but what if background process is killed? Maybe my messenger app?
This comparison is not accurate. 1. In the UX before case, it's unfair you're comparing to user intervention to kill the app rather than oom-killer. 2. oom-killer reports to the journal. earlyoom reports to the journal. They're the same. 3. Quite a lot of errors and crashes are only ever reported to the journal. 4. UX after (i.e. with earlyoom running), the system starts to hang, you should understand there's an issue, recovery shouldn't take quite as long but you'll still wish the system hadn't become hung up in the first place 5. The app is quit with SIGTERM, not killed 6. kernel oom-killer can kill background processes too
I am going to keep working in my main app without noticing that I lost something, not knowing that I need to take action. And my system now runs in a weird state, and can stay there for days, which will lead to more weird and nonreproducible errors later.
No different than with oom-killer, assuming you're willing to wait for it take action. If you force power off instead, there's some chance you're still going to do that with earlyoom because the responsivity problem has more to do with congestion as a result of heavy swapping.
The "hang" of a system was the feedback user got from the system that there is something wrong. Not ideal, but at least there was something. With this feature we don't solve the issue, we remove the "bad" feedback, and we don't provide any replacement for it making memory problem completely invisible.
Is it really a good UX?
Insofar as aggravation is definitely not good UX, I'd say for sure it's better to reduce user aggravation. They will still experience the hang. It just won't last quite as long, and yet it still will be too long.
On Tue, Jan 7, 2020 at 8:09 AM Aleksandra Fedorova alpha@bookwar.info wrote:
UX before: system works, I run heavy application, system starts to hang, i understand that there is an issue, i can kill the app or reboot, which gives me clean and working system.
UX after: system works, no visible problems. Suddenly random app disappears, no errors or crashes reported to me. It might be that my active app is killed, then I know that smth happened, but what if background process is killed? Maybe my messenger app?
Or actually:
UX before: system works, I run a heavy application, system starts to hang, I can't even move my mouse, the application doesn't respond to Alt+F4, I wait patiently for a few minutes then give up and hard-reboot
UX after: system works, I run a heavy application, system starts to hang, I can't even move my mouse, the application doesn't respond to Alt+F4, I wait patiently for a few minutes then the application disappears and I have a functional system again
In your example you forget that swap needs to filled almost to full for early-oom to start reacting. That takes time during which the system responsibility is abysmal. The UX difference happens only after you've already suffered through a serious responsivity degradation, and the only difference is the end state, *if* you've managed to wait long enough for early-oom to kick in (which happens earlier than kernel oom and with better results about which process gets killed, according to Chris).
On Tue, Jan 7, 2020 at 11:23 am, Kamil Paral kparal@redhat.com wrote:
In your example you forget that swap needs to filled almost to full for early-oom to start reacting. That takes time during which the system responsibility is abysmal. The UX difference happens only after you've already suffered through a serious responsivity degradation, and the only difference is the end state, *if* you've managed to wait long enough for early-oom to kick in (which happens earlier than kernel oom and with better results about which process gets killed, according to Chris).
Right, we understand this. earlyoom (or a systemd-level OOM solution) is only half the solution. The other half will be fixing swap. That will probably require (a) reducing the amount of swap created by anaconda, and/or (b) swap on zram.
On 1/5/20 12:38 AM, Chris Murphy wrote:
On Sat, Jan 4, 2020 at 2:51 AM Aleksandra Fedorova alpha@bookwar.info wrote:
Since in the Change we are not introducing just the earlyoom tool but enable it with a specific profile I would add those details here. Smth like:
"earlyoom service will choose the offending process based on the same oom_score as kernel uses. It will send a SIGTERM signal on 10% of RAM left, and SIGKILL on 5%"
I add this information to the summary. Also, I think these numbers may need to change to avoid prematurely sending SIGTERM when the system has no swap device.
I read that sentence in a different way: "earlyoom will make only 90% of your RAM available, so it is effectively using 10% of your RAM".
On my 32GB laptop that means 3.2GB of RAM gets unusable. And on my 64GB machine I'm being robbed of 6.4GB. Wow.
How low can these numbers be pushed? Even 3% would be 1GB out of 32GB.
Regards.
On Mon, Jan 6, 2020 at 4:57 AM Roberto Ragusa mail@robertoragusa.it wrote:
On 1/5/20 12:38 AM, Chris Murphy wrote:
On Sat, Jan 4, 2020 at 2:51 AM Aleksandra Fedorova alpha@bookwar.info wrote:
Since in the Change we are not introducing just the earlyoom tool but enable it with a specific profile I would add those details here. Smth like:
"earlyoom service will choose the offending process based on the same oom_score as kernel uses. It will send a SIGTERM signal on 10% of RAM left, and SIGKILL on 5%"
I add this information to the summary. Also, I think these numbers may need to change to avoid prematurely sending SIGTERM when the system has no swap device.
I read that sentence in a different way: "earlyoom will make only 90% of your RAM available, so it is effectively using 10% of your RAM".
On my 32GB laptop that means 3.2GB of RAM gets unusable. And on my 64GB machine I'm being robbed of 6.4GB. Wow.
How low can these numbers be pushed? Even 3% would be 1GB out of 32GB.
What you say is only true in the case of systems with no swap. That's mentioned in the proposal. If swap is being used, for sure essentially all of your RAM is being used, so it's swap that's the determining factor. If you don't have swap, yes RAM becomes the determining factor and I agree that on systems with a lot of RAM, 10% is too high.
The ideal scenario is to not run earlyoom at all on systems that do not have a swap device. They're not going to run into the responsivity problem anyway, which is a direct consequence of heavy swapping.
On Fri, Jan 03, 2020 at 02:18:40PM -0500, Ben Cotton wrote:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
Hi, I'll throw out another idea out here, in hope that people can provide insight. It's something I wanted to look into for a while, but I admit to not having done any research myself, so the approach might be totally useless...
What about using the memory controller for user units to allocate memory resources between the processes in the user session? Thanks to recent developments, the gnome session uses separate systemd units (and thus separate cgroups) for various services. We could set attributes like memory.low for "the basic components of the user session", and on the other hand, memory.swap.max for "the payload", i.e. various user processes on top.
Doing something like this effectively would most likely require some changes to how we assign processes to cgroups. I still get some processes in "wrong" cgroups:
│ ├─gnome-shell-wayland.service │ │ ├─1501571 /usr/bin/gnome-shell │ │ ├─1501606 /usr/bin/Xwayland :0 -rootless -noreset -accessx -core -auth /run/user/1000/.mutter-Xwaylandauth.SCXID0 -listen 4 -listen 5 -displayfd 6 │ │ ├─1501713 ibus-daemon --panel disable -r --xim │ │ ├─1501718 /usr/libexec/ibus-dconf │ │ ├─1501719 /usr/libexec/ibus-extension-gtk3 │ │ ├─1501724 /usr/libexec/ibus-x11 --kill-daemon │ │ ├─1501980 /usr/libexec/ibus-engine-simple │ │ ├─1503586 /usr/lib64/firefox/firefox │ │ ├─1503691 /usr/lib64/firefox/firefox -contentproc -childID 2 -isForBrowser ... │ │ ├─1503701 /usr/lib64/firefox/firefox -contentproc -childID 3 -isForBrowser ... │ │ ├─1503747 /usr/lib64/firefox/firefox -contentproc -childID 4 -isForBrowser ... │ │ ├─1520219 bwrap --args 35 telegram-desktop -- │ │ ├─1520229 bwrap --args 35 xdg-dbus-proxy --args=37 │ │ ├─1520230 xdg-dbus-proxy --args=37 │ │ ├─1520232 bwrap --args 35 telegram-desktop -- │ │ ├─1520233 /app/bin/Telegram -- │ │ ├─1540753 pavucontrol ...
(and firefox and anything-running-as-flatpak would be the prime candidates to split out into their own cgroups and build resource limits around...)
The cgroup hierarchy is mostly flat (most user services get cgroups directly in the root of the user tree under /sys/fs/cgroup/user.slice/user-nnn.slice/user@nnn.service/). To make resource assignment effective, I would like to see a "basic.slice" (name TBD) that would gather various "core" stuff like gnome-shell-wayland.service, dbus-broker.service, and whatever other services that the graphical session depends on. This would get mimimum memory protections and such.
Then there should be "payload.slice", and underneath that all the non-essential services and everything that the user starts from the terminal.
What I *don't know* is: how much of an overhead enabling the memory controller has, and whether those resource limits would actually have the desired effect (and more generally, how they should be best set).
Zbyszek
On Sat, Jan 4, 2020 at 11:38 am, Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl wrote:
What about using the memory controller for user units to allocate memory resources between the processes in the user session? Thanks to recent developments, the gnome session uses separate systemd units (and thus separate cgroups) for various services. We could set attributes like memory.low for "the basic components of the user session", and on the other hand, memory.swap.max for "the payload", i.e. various user processes on top.
This looks interesting. I'd love to see more serious discussion of this proposal. Carving out dedicated memory for essential desktop processes seems like something we should be able to do in 2020.
Michael
On Sat, 2020-01-04 at 12:17 -0600, Michael Catanzaro wrote:
On Sat, Jan 4, 2020 at 11:38 am, Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl wrote:
What about using the memory controller for user units to allocate memory resources between the processes in the user session? Thanks to recent developments, the gnome session uses separate systemd units (and thus separate cgroups) for various services. We could set attributes like memory.low for "the basic components of the user session", and on the other hand, memory.swap.max for "the payload", i.e. various user processes on top.
This looks interesting. I'd love to see more serious discussion of this proposal. Carving out dedicated memory for essential desktop processes seems like something we should be able to do in 2020.
And it seems like it is: In the issue about this whole topic some implemented solutions where mentioned: https://github.com/Nefelim4ag/Ananicy
But not further commented at least on pagure. https://pagure.io/fedora-workstation/issue/98#comment-615424
Which I think is quite sad as those seem to be the way better way to handle those things. Having a daemon that assigns cgroups to processes seems to let the kernel do its thing and keep us all sane and keeps the system reasonable responsive.
I guess the important question here is: Does it really prevent hanging and what's the origin of hanging? Is it that the kernel starts to swap and therefore eats up all CPU time or is it the programs in foreground that suddenly all try to get their piece of memory back that forces kswapd onto the CPU?
My guess would be the latter, but I'm sure the group who did the research on this topic has a better insight into this.
On Fr, 03.01.20 14:18, Ben Cotton (bcotton@redhat.com) wrote:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
Hmm, are we sure this is something we want to have in the default install? Is the code really good enough for that?
Looking at the sources very superficially I see a couple of problems:
1. Waking up all the time in 100ms intervals? We generally try to avoid waking the CPU up all the time if nothing happens. Saving power and things.
2. New code using system() in the year 2020? Really?
3. Fixed size buffers and implicit, undetected, truncation of strings at various places (for example, when formatting the shell string to pass to system()).
But more importantly: are we sure this actually operates the way we should? i.e. PSI is really what should be watched. It is not interesting who uses how much memory and triggering kills on that. What matters is to detect when the system becomes slow due to that, i.e. *latencies* introduced due to memory pressure and that's what PSI is about, and hence what should be used.
But even if we'd ignore that in order fight latencies one should watch latencies: OOM killing per process is just not appropriate on a systemd system: all our system services (and a good chunk of our user services too) are sorted neatly into cgroups, and we really should kill them as a whole and not just individual processes inside them. systemd manages that today, and makes exceptions configurable via OOMPolicy=, and with your earlyoom stuff you break that.
This looks like second guessing the kernel memory management folks at a place where one can only lose, and at the time breaking correct OOM reporting by the kernel via cgroups and stuff.
Also: what precisely is this even supposed to do? Replace the algorithm for detecting *when* to go on a kill rampage? Or actually replace the algorithm selecting *what* to kill during a kill rampage?
If it's the former (which the name of the project suggests, _early_oom)), then at the most basic the tool should let the kernel do the killing, i.e. "echo f > /proc/sysrq-trigger". That way the reporting via cgroups isn't fucked, and systemd can still do its thing, and the kernel can kill per cgroup rather than per process...
Anyway, this all sounds very very fishy to me. Not thought to the end, and I am pretty sure this is something the kernel memory management folks should give a blessing to. Second guessing the kernel like that is just a bad idea if you ask me.
I mean, yes, the OOM killer might not be that great currently, but this sounds like something to fix in kernel land, and if that doesn't work out for some reason because kernel devs can't agree, then do it as fallback in userspace, but with sound input from the kernel folks, and the blessing of at least some of the kernel folks.
Lennart
-- Lennart Poettering, Berlin
On Mon, Jan 6, 2020 at 5:08 AM Lennart Poettering mzerqung@0pointer.de wrote:
I mean, yes, the OOM killer might not be that great currently, but this sounds like something to fix in kernel land, and if that doesn't work out for some reason because kernel devs can't agree, then do it as fallback in userspace, but with sound input from the kernel folks, and the blessing of at least some of the kernel folks.
I agree that the implementation may need some work, but one thing should be clear: it is not a winning strategy to wait for the kernel folks to fix this. They have for all practical purposes given up on this problem, and are not going to solve the issue for us.
On Mon, Jan 6, 2020 at 3:08 AM Lennart Poettering mzerqung@0pointer.de wrote:
Looking at the sources very superficially I see a couple of problems:
- Waking up all the time in 100ms intervals? We generally try to avoid waking the CPU up all the time if nothing happens. Saving power and things.
I agree. What do you think is a reasonable interval? Given that earlyoom won't SIGTERM until both 10% memory free and 10% swap free, and that will take at least some seconds, what about an interval of 3 seconds?
But more importantly: are we sure this actually operates the way we should? i.e. PSI is really what should be watched. It is not interesting who uses how much memory and triggering kills on that. What matters is to detect when the system becomes slow due to that, i.e. *latencies* introduced due to memory pressure and that's what PSI is about, and hence what should be used.
Earlyoom is a short term stop gap while a more sophisticated solution is still maturing. That being low-memory-monitor, which does leverage PSI.
But even if we'd ignore that in order fight latencies one should watch latencies: OOM killing per process is just not appropriate on a systemd system: all our system services (and a good chunk of our user services too) are sorted neatly into cgroups, and we really should kill them as a whole and not just individual processes inside them. systemd manages that today, and makes exceptions configurable via OOMPolicy=, and with your earlyoom stuff you break that.
OOMPolicy= depends on the kernel oom-killer, which is extremely reluctant to trigger at all. Consistently in my testing, the vast majority of the time, kernel oom-killer takes > 30 minutes to trigger. And it may not even kill the worst offender, but rather something like sshd. A couple of times, I've seen it kill systemd-journald. That's not a small problem.
earlyoom first sends SIGTERM. It's not different from the user saying, enough of this, let's just gracefully quit the offending process. Only if the problem continues to get worse is SIGKILL sent.
This looks like second guessing the kernel memory management folks at a place where one can only lose, and at the time breaking correct OOM reporting by the kernel via cgroups and stuff.
It is intended to be a substitute for the user hitting the power button. It's not intended as a substitute for the OS, as a whole, improving its user advocacy to do the right thing in the first place, which currently it isn't.
For now, kernel developers have made it clear they do not care about user space responsiveness. At all. Their concern with kernel oom-killer is strictly with keeping the kernel functioning. And the congestion that results from heavy simultaneous page-in and page-out also appears to not be a concern for kernel developers, it's a well known problem, and they haven't made any break through in this area.
So it's really going to need to be user space managed, leveraging PSI and cgroupv2. And that's the next step.
Also: what precisely is this even supposed to do? Replace the algorithm for detecting *when* to go on a kill rampage? Or actually replace the algorithm selecting *what* to kill during a kill rampage?
a. It's never a kill rampage. b. When: It first uses SIGTERM at 10% remaining for both memory and swap; and SIGKILL at 5%. In hundreds of tests I've never seen earlyoom use SIGKILL, so far everything responds fairly immediately to SIGTERM. But I'm also testing with well behaved programs, nothing malicious. And that's intentional. This problem is actually far worse if it were malicious. c. What: Same as kernel oom-killer. It uses oom_score.
It isn't replacing anything. It's acting as a user advocate by approximating what a reasonable user would do, SIGTERM. The user can't do this themselves because during heavy swap system responsivity is already lost, before we're even close to OOM.
You're right, someone should absolutely solve the responsivity problem. Kernel folks have clearly ceded this. Can it be done with cgroupv2 and PSI alone? Unclear.
If it's the former (which the name of the project suggests, _early_oom)), then at the most basic the tool should let the kernel do the killing, i.e. "echo f > /proc/sysrq-trigger". That way the reporting via cgroups isn't fucked, and systemd can still do its thing, and the kernel can kill per cgroup rather than per process...
That would be a killing rampage. sysrq+f issues SIGKILL and definitely results in data loss, always. Earlyoom uses SIGTERM as a first step, which is a much more conservative first attempt.
Anyway, this all sounds very very fishy to me. Not thought to the end, and I am pretty sure this is something the kernel memory management folks should give a blessing to. Second guessing the kernel like that is just a bad idea if you ask me.
There's no first or second guessing. The kernel oom-killer is strictly responsible for maintaining enough resources for the kernel. Not system responsivity. The idea of user space oom management is to take user space priorities into account, which kernel folks have rather intentionally stayed out of answering.
I mean, yes, the OOM killer might not be that great currently, but this sounds like something to fix in kernel land, and if that doesn't work out for some reason because kernel devs can't agree, then do it as fallback in userspace, but with sound input from the kernel folks, and the blessing of at least some of the kernel folks.
The kernel oom-killler works exactly as intended and designed to work. It does not give either one or two shits about user space. It cares only about proper kernel function. And to that end it's working 100% effectively near as I can tell.
The mistake most people are making is the idea the kernel oom-killer is intended as a user space, let alone end user, advocate. That is what earlyoom and other user space oom managers are trying to do.
I do rather like your idea from some months ago, about moving to systems that have a much smaller swap during normal use, and only creating+activating a large swap at hibernation time. And after resuming from hibernation, deactivating the large swap. That way during normal operation, only "incidental" swap is allowed, and heavy swap very quickly has nowhere to go. And tied to that, a way to restrict the resources unprivileged processes get, rather than being allowed to overcommit - something low-memory-monitor attempts to achieve.
On Mo, 06.01.20 08:51, Chris Murphy (lists@colorremedies.com) wrote:
On Mon, Jan 6, 2020 at 3:08 AM Lennart Poettering mzerqung@0pointer.de wrote:
Looking at the sources very superficially I see a couple of problems:
- Waking up all the time in 100ms intervals? We generally try to avoid waking the CPU up all the time if nothing happens. Saving power and things.
I agree. What do you think is a reasonable interval? Given that earlyoom won't SIGTERM until both 10% memory free and 10% swap free, and that will take at least some seconds, what about an interval of 3 seconds?
None. Use PSI. It wakes you up only when pressure stalls reach threshold you declare. Which basically means you never steal the CPUs on an idle system, you never cause a wakeup whatsoever.
But more importantly: are we sure this actually operates the way we should? i.e. PSI is really what should be watched. It is not interesting who uses how much memory and triggering kills on that. What matters is to detect when the system becomes slow due to that, i.e. *latencies* introduced due to memory pressure and that's what PSI is about, and hence what should be used.
Earlyoom is a short term stop gap while a more sophisticated solution is still maturing. That being low-memory-monitor, which does leverage PSI.
Yes, l-m-m is great. If we can deploy l-m-m today already, why isn't it good enoug for earlyoom?
But even if we'd ignore that in order fight latencies one should watch latencies: OOM killing per process is just not appropriate on a systemd system: all our system services (and a good chunk of our user services too) are sorted neatly into cgroups, and we really should kill them as a whole and not just individual processes inside them. systemd manages that today, and makes exceptions configurable via OOMPolicy=, and with your earlyoom stuff you break that.
OOMPolicy= depends on the kernel oom-killer, which is extremely reluctant to trigger at all. Consistently in my testing, the vast majority of the time, kernel oom-killer takes > 30 minutes to trigger. And it may not even kill the worst offender, but rather something like sshd. A couple of times, I've seen it kill systemd-journald. That's not a small problem.
Well, that sounds as if OOMScoreAdjust= of these services should be tweaked. In journald we us OOMScoreAdjust=-250 and in udevd OOMScoreAdjust=-1000.
If journald is still killed too likely, we can certainly bump it to -900 or so, please file a bug.
earlyoom first sends SIGTERM. It's not different from the user saying, enough of this, let's just gracefully quit the offending process. Only if the problem continues to get worse is SIGKILL sent.
This sounds as if you want low-memory-monitor, but for all services, right?
Sounds like something that is relatively easily implementable in systemd though, in a much better way, i.e. hooked to PSI...
For now, kernel developers have made it clear they do not care about user space responsiveness. At all. Their concern with kernel
References to this? I mean, the kernel developers are not a single person, they tend to have different opinions...
Also: what precisely is this even supposed to do? Replace the algorithm for detecting *when* to go on a kill rampage? Or actually replace the algorithm selecting *what* to kill during a kill rampage?
a. It's never a kill rampage.
it calls kill(), which I call a "kill rampage"...
It isn't replacing anything. It's acting as a user advocate by approximating what a reasonable user would do, SIGTERM. The user can't do this themselves because during heavy swap system responsivity is already lost, before we're even close to OOM.
You're right, someone should absolutely solve the responsivity problem. Kernel folks have clearly ceded this. Can it be done with cgroupv2 and PSI alone? Unclear.
Sounds like someone needs to do their homework, if this is "unclear"?
I mean, you basically admit here that this isn't really figured out to the end. Maybe let's give this a bit more time and figure things out a bit more, instead of rushing earlyoom in?
Adopting something now, at a point we already clearly know that PSI is how this should be done sounds very wrong to me.
That would be a killing rampage. sysrq+f issues SIGKILL and definitely results in data loss, always. Earlyoom uses SIGTERM as a first step, which is a much more conservative first attempt.
But it sends SIGKILL next? Why? Why not sysrq+f trggred from userspace for that?
I must say the idea that there are effectively multiple process babysitters now, which both want to decide when to terminate services sounds very wrong to me...
I mean, wouldn't this all be solved much nicer, much more future proof, if someone would just do what l-m-m does as part of systemd service management, i.e. let's say an option StopOnMemoryPressure= that watches PSI and terminates services *cleanly* when needed, i.e. goes through ExecStop= and such?
And you know what, PSI is precisely defined to be used for purposes like this, we already have experience with it (see l-m-m) and a patch adding this to systemd isn#t really that hard either...
I do rather like your idea from some months ago, about moving to systems that have a much smaller swap during normal use, and only creating+activating a large swap at hibernation time. And after resuming from hibernation, deactivating the large swap. That way during normal operation, only "incidental" swap is allowed, and heavy swap very quickly has nowhere to go. And tied to that, a way to restrict the resources unprivileged processes get, rather than being allowed to overcommit - something low-memory-monitor attempts to achieve.
Memory paging doesn't just mean swapping, i.e. writing stuff to and reading stuff from a swap partition of some kind. Paging also means that program code is memory mapped from binary files and can be loaded into memory and unloaded any time since it can be re-read whenever it is needed. Thus, you should be able to get under memory pressure even without swap simply because the program code of the programs you run needs to be paged in/out all the time from your rootfs, rather than from a swap partition...
Anyway, still not convinced having this is a good idea. There's a lot of homework to be done first...
Lennart
-- Lennart Poettering, Berlin
On Mo, 06.01.20 17:47, Lennart Poettering (mzerqung@0pointer.de) wrote:
On Mo, 06.01.20 08:51, Chris Murphy (lists@colorremedies.com) wrote:
On Mon, Jan 6, 2020 at 3:08 AM Lennart Poettering mzerqung@0pointer.de wrote:
Looking at the sources very superficially I see a couple of problems:
- Waking up all the time in 100ms intervals? We generally try to avoid waking the CPU up all the time if nothing happens. Saving power and things.
I agree. What do you think is a reasonable interval? Given that earlyoom won't SIGTERM until both 10% memory free and 10% swap free, and that will take at least some seconds, what about an interval of 3 seconds?
None. Use PSI. It wakes you up only when pressure stalls reach threshold you declare. Which basically means you never steal the CPUs on an idle system, you never cause a wakeup whatsoever.
But more importantly: are we sure this actually operates the way we should? i.e. PSI is really what should be watched. It is not interesting who uses how much memory and triggering kills on that. What matters is to detect when the system becomes slow due to that, i.e. *latencies* introduced due to memory pressure and that's what PSI is about, and hence what should be used.
Earlyoom is a short term stop gap while a more sophisticated solution is still maturing. That being low-memory-monitor, which does leverage PSI.
Yes, l-m-m is great. If we can deploy l-m-m today already, why isn't it good enoug for earlyoom?
Oops, sorry. I mean GMemoryMonitor. I assumed l-m-m and GMemoryMonitor was the same thing, but they aren't. I am not sure about l-m-m, haven't looked at it in detail.
GMemoryMonitor = great l-m-m = no idea
Lennart
-- Lennart Poettering, Berlin
On Mon, Jan 6, 2020 at 5:47 pm, Lennart Poettering mzerqung@0pointer.de wrote:
Yes, l-m-m is great. If we can deploy l-m-m today already, why isn't it good enoug for earlyoom?
GMemoryMonitor is the GLib API that's implemented using low-memory-monitor's D-Bus API.
In practice, using it for OOM killing is not that great, though. We've rejected this approach because the OOM killing is causing serious problems and there are no plans to fix it: https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/8. Therefore most likely we'll use l-m-m only for advisory memory pressure notifications.
On Mon, Jan 6, 2020 at 5:47 pm, Lennart Poettering mzerqung@0pointer.de wrote:
Sounds like someone needs to do their homework, if this is "unclear"?
I mean, you basically admit here that this isn't really figured out to the end. Maybe let's give this a bit more time and figure things out a bit more, instead of rushing earlyoom in?
Adopting something now, at a point we already clearly know that PSI is how this should be done sounds very wrong to me.
I think it would absolutely be reasonable to defer from F32 -> F33 if we have concrete plans to use that delay to implement an OOM solution. E.g. if you or someone else wanted to throw together a systemd-level solution:
Sounds like something that is relatively easily implementable in systemd though, in a much better way, i.e. hooked to PSI...
I mean, wouldn't this all be solved much nicer, much more future proof, if someone would just do what l-m-m does as part of systemd service management, i.e. let's say an option StopOnMemoryPressure= that watches PSI and terminates services *cleanly* when needed, i.e. goes through ExecStop= and such?
And you know what, PSI is precisely defined to be used for purposes like this, we already have experience with it (see l-m-m) and a patch adding this to systemd isn#t really that hard either...
So again, the problem with PSI so far is that we haven't gotten it to work well. If systemd can make it work well, that would be super lovely. Sounds like that would also avoid continuous wakeups, which would be very nice.
I don't think anybody would object to a systemd-level solution. If it's part of systemd, there would no longer be concerns about architecture or code quality, and it'd feel much less hackish. We would want to test it to ensure responsiveness is comparable to what earlyoom would offer, of course.
Michael
On Mo, 06.01.20 11:22, Michael Catanzaro (mcatanzaro@gnome.org) wrote:
So I talked to Tejun Heo about this (kernel cgroups maintainer, working for facebook with the people who did the PSI stuff, kernel mm guy). Here's the gist:
- earlyoom might be OK as short time stopgap if people really want to hurry something, as long as it watches only swap depletion (which it pretty much does already). But it should then also determine what to kill taking the swap use into account and little else (which it apparently does not). This doesn't make any sense to have though if there is no swap.
- Don't bother with the OOM score the kernel calculates for processes, it doesn't take the swap use into account. That said, do take the configurable OOM score *adjustment* into account, so that processes which set that are respected, i.e. journald, udevd, and such. (or in otherwords, ignore /proc/$PID/oom_score, but respect /proc/PID/oom_score_adj).
- going down to 100ms poll intervals is a bad idea, 1s is sufficient, maybe higher.
- facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
- oomd currently polls some parameters in time intervals too, still. They are working on getting rid of that too, so that everything is event based via PSI. Given their own focus on servers it's not a primary goal, but still a goal.
Or in other words: oomd is the way to go in the long run, developed alongside the kernel features backing it. You can use it already if you like, but there are still too many knobs for generic deployment. earlyoom might be a valid temporary stopgap if you want to hurry this.
(And now I hope I paraphrased everything he said more or less correctly...)
if you want to know more about fb's oomd: https://cfp.all-systems-go.io/ASG2019/talk/DQX3DH/
(but before this will enter systemd it's gonna be dumbed down, i.e, less configuration, more "just works")
Lennart
-- Lennart Poettering, Berlin
On Mon, Jan 6, 2020 at 7:10 PM Lennart Poettering mzerqung@0pointer.de wrote:
- going down to 100ms poll intervals is a bad idea, 1s is sufficient, maybe higher.
According to the project readme, the query interval is 100ms only if the lack or free RAM starts to get severe. Otherwise the interval is claimed to be longer. I haven't checked the code, though.
On Mon, Jan 6, 2020 at 11:09 AM Lennart Poettering mzerqung@0pointer.de wrote:
On Mo, 06.01.20 11:22, Michael Catanzaro (mcatanzaro@gnome.org) wrote:
So I talked to Tejun Heo about this (kernel cgroups maintainer, working for facebook with the people who did the PSI stuff, kernel mm guy). Here's the gist:
earlyoom might be OK as short time stopgap if people really want to hurry something, as long as it watches only swap depletion (which it pretty much does already). But it should then also determine what to kill taking the swap use into account and little else (which it apparently does not). This doesn't make any sense to have though if there is no swap.
Don't bother with the OOM score the kernel calculates for processes, it doesn't take the swap use into account. That said, do take the configurable OOM score *adjustment* into account, so that processes which set that are respected, i.e. journald, udevd, and such. (or in otherwords, ignore /proc/$PID/oom_score, but respect /proc/PID/oom_score_adj).
going down to 100ms poll intervals is a bad idea, 1s is sufficient, maybe higher.
facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
oomd currently polls some parameters in time intervals too, still. They are working on getting rid of that too, so that everything is event based via PSI. Given their own focus on servers it's not a primary goal, but still a goal.
Or in other words: oomd is the way to go in the long run, developed alongside the kernel features backing it. You can use it already if you like, but there are still too many knobs for generic deployment. earlyoom might be a valid temporary stopgap if you want to hurry this.
(And now I hope I paraphrased everything he said more or less correctly...)
Thanks for all of that, and it's consistent with the research and discussion the working group have done in the past 6 months on this subject. What I can't estimate is whether oomd or lmm will be better long term for Fedora Workstation, or if there's an advantage of them co-existing.
And yes the idea is to go a little faster. Earlyoom is easy to take out. And I have no problem with it coming out in fc33 if oomd or (more likely) lmm are ready by then.
On Mon, Jan 6, 2020 at 11:53 am, Chris Murphy lists@colorremedies.com wrote:
And yes the idea is to go a little faster. Earlyoom is easy to take out. And I have no problem with it coming out in fc33 if oomd or (more likely) lmm are ready by then.
Brainstorming: if a systemd-level solution were to be ready in the F33 or F34 timeframe, I'd be OK with just waiting for that. We've had 31 Fedora releases without earlyoom and one or two more isn't the end of the world. Seems easier than installing earlyoom on everybody's computers and then calling "backsies!" a year later.
Of course we would need to monitor progress at the systemd level to make sure this solution is advancing as desired, and fall back to plans for earlyoom if things get off track.
Michael
On Mon, Jan 6, 2020 at 7:09 pm, Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Asking around, I understand oomd only operates at the cgroup level, i.e. it kills an entire cgroup at once, not individual processes. So I understand this would also depend on GNOME-level work to ensure individual applications get launched in their own systemd scopes, yes?
Michael
On Mon, Jan 06, 2020 at 02:53:13PM -0600, Michael Catanzaro wrote:
On Mon, Jan 6, 2020 at 7:09 pm, Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for
everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Asking around, I understand oomd only operates at the cgroup level, i.e. it kills an entire cgroup at once, not individual processes. So I understand this would also depend on GNOME-level work to ensure individual applications get launched in their own systemd scopes, yes?
I wanted to ask about this too... but didn't know where ;) As of today, gnome-shell in F31 seems to start almost everything as separate systemd user scopes:
- various services started automaticlly like /usr/libexec/gsd-power, /usr/libexec/gsd-sound, etc.
- flatpaks (this seems to be new, I had them running under gnome-shell-wayland.service last week!)
Stuff started from the run dialog (alt-f2) and from the overview still seems to land in gnome-shell-wayland.service, but maybe this is fixed in gnome-shell 3.35?
Another issue is that things that are started through the gnome terminal also land in gnome-terminal-server.service. They need to get their own scopes to make resource allocation robust.
It seems we're quite close! Do we just need to wait for another gnome release and then we'll have everything nicely segregated?
Zbyszek
On Tue, 2020-01-07 at 09:47 +0000, Zbigniew Jędrzejewski-Szmek wrote:
On Mon, Jan 06, 2020 at 02:53:13PM -0600, Michael Catanzaro wrote:
On Mon, Jan 6, 2020 at 7:09 pm, Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for
everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Asking around, I understand oomd only operates at the cgroup level, i.e. it kills an entire cgroup at once, not individual processes. So I understand this would also depend on GNOME-level work to ensure individual applications get launched in their own systemd scopes, yes?
I wanted to ask about this too... but didn't know where ;) As of today, gnome-shell in F31 seems to start almost everything as separate systemd user scopes:
various services started automaticlly like /usr/libexec/gsd-power, /usr/libexec/gsd-sound, etc.
flatpaks (this seems to be new, I had them running under gnome-shell-wayland.service last week!)
Hmm, pretty sure flatpaks have always created their own scopes.
Stuff started from the run dialog (alt-f2) and from the overview still seems to land in gnome-shell-wayland.service, but maybe this is fixed in gnome-shell 3.35?
This should have changed with the gnome-shell 3.34.2 update in Fedora 31. It may be that it has not reached rawhide yet though.
Another issue is that things that are started through the gnome terminal also land in gnome-terminal-server.service. They need to get their own scopes to make resource allocation robust.
Do you think we should just place each VT into its own scope?
That seems like a reasonable start in principle, though graphical applications launched from the terminal may still not be moved into their own scope then.
It seems we're quite close! Do we just need to wait for another gnome release and then we'll have everything nicely segregated?
Likely not perfect, but hopefully close enough for many purposes :)
Benjamin
On Tue, Jan 07, 2020 at 11:07:49AM +0100, Benjamin Berg wrote:
On Tue, 2020-01-07 at 09:47 +0000, Zbigniew Jędrzejewski-Szmek wrote:
On Mon, Jan 06, 2020 at 02:53:13PM -0600, Michael Catanzaro wrote:
On Mon, Jan 6, 2020 at 7:09 pm, Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for
everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Asking around, I understand oomd only operates at the cgroup level, i.e. it kills an entire cgroup at once, not individual processes. So I understand this would also depend on GNOME-level work to ensure individual applications get launched in their own systemd scopes, yes?
I wanted to ask about this too... but didn't know where ;) As of today, gnome-shell in F31 seems to start almost everything as separate systemd user scopes:
various services started automaticlly like /usr/libexec/gsd-power, /usr/libexec/gsd-sound, etc.
flatpaks (this seems to be new, I had them running under gnome-shell-wayland.service last week!)
Hmm, pretty sure flatpaks have always created their own scopes.
I'm quoting from my mail from this same thread:
│ ├─gnome-shell-wayland.service │ │ ├─1501571 /usr/bin/gnome-shell │ │ ├─1501606 /usr/bin/Xwayland :0 -rootless -noreset -accessx -core -auth /run/user/1000/.mutter-Xwaylandauth.SCXID0 -listen 4 -listen 5 -displayfd 6 │ │ ├─1501713 ibus-daemon --panel disable -r --xim │ │ ├─1501718 /usr/libexec/ibus-dconf │ │ ├─1501719 /usr/libexec/ibus-extension-gtk3 │ │ ├─1501724 /usr/libexec/ibus-x11 --kill-daemon │ │ ├─1501980 /usr/libexec/ibus-engine-simple │ │ ├─1503586 /usr/lib64/firefox/firefox │ │ ├─1503691 /usr/lib64/firefox/firefox -contentproc -childID 2 -isForBrowser ... │ │ ├─1503701 /usr/lib64/firefox/firefox -contentproc -childID 3 -isForBrowser ... │ │ ├─1503747 /usr/lib64/firefox/firefox -contentproc -childID 4 -isForBrowser ... │ │ ├─1520219 bwrap --args 35 telegram-desktop -- │ │ ├─1520229 bwrap --args 35 xdg-dbus-proxy --args=37 │ │ ├─1520230 xdg-dbus-proxy --args=37 │ │ ├─1520232 bwrap --args 35 telegram-desktop -- │ │ ├─1520233 /app/bin/Telegram -- │ │ ├─1540753 pavucontrol ...
So maybe a bug? I'll keep watching if it happens again.
Stuff started from the run dialog (alt-f2) and from the overview still seems to land in gnome-shell-wayland.service, but maybe this is fixed in gnome-shell 3.35?
This should have changed with the gnome-shell 3.34.2 update in Fedora 31. It may be that it has not reached rawhide yet though.
I'm still on gnome-shell-3.34.1-4.fc31.x86_64. I'll try the latest version.
Another issue is that things that are started through the gnome terminal also land in gnome-terminal-server.service. They need to get their own scopes to make resource allocation robust.
Do you think we should just place each VT into its own scope?
Yes. Everything starting at the shell (or whatever command is configured as the "payload", should get its own scope) and a separate set of resources than gnome-terminal-server.service.
That seems like a reasonable start in principle, though graphical applications launched from the terminal may still not be moved into their own scope then.
I think it is OK. After all, starting graphical applications from the terminal is a special case. If desired, the user may run 'systemd-run --user foo' if they want to segregate it. (Actually, we might teach some apps to put themselves into a scope when started from a command line. This makes sense for stuff like firefox, but also screen/tmux and others. But I consider this a completely separate issue.)
Zbyszek
It seems we're quite close! Do we just need to wait for another gnome release and then we'll have everything nicely segregated?
Likely not perfect, but hopefully close enough for many purposes :)
Benjamin
On Tue, 2020-01-07 at 10:21 +0000, Zbigniew Jędrzejewski-Szmek wrote:
On Tue, Jan 07, 2020 at 11:07:49AM +0100, Benjamin Berg wrote:
On Tue, 2020-01-07 at 09:47 +0000, Zbigniew Jędrzejewski-Szmek wrote:
I wanted to ask about this too... but didn't know where ;) As of today, gnome-shell in F31 seems to start almost everything as separate systemd user scopes:
- various services started automaticlly like /usr/libexec/gsd-
power, /usr/libexec/gsd-sound, etc.
- flatpaks (this seems to be new, I had them running under gnome-shell-wayland.service last week!)
Hmm, pretty sure flatpaks have always created their own scopes.
I'm quoting from my mail from this same thread:
│ ├─gnome-shell-wayland.service │ │ ├─1501571 /usr/bin/gnome-shell │ │ ├─1501606 /usr/bin/Xwayland :0 -rootless -noreset -accessx -core -auth /run/user/1000/.mutter-Xwaylandauth.SCXID0 -listen 4 -listen 5 -displayfd 6 │ │ ├─1501713 ibus-daemon --panel disable -r --xim │ │ ├─1501718 /usr/libexec/ibus-dconf │ │ ├─1501719 /usr/libexec/ibus-extension-gtk3 │ │ ├─1501724 /usr/libexec/ibus-x11 --kill-daemon │ │ ├─1501980 /usr/libexec/ibus-engine-simple │ │ ├─1503586 /usr/lib64/firefox/firefox │ │ ├─1503691 /usr/lib64/firefox/firefox -contentproc -childID 2 -isForBrowser ... │ │ ├─1503701 /usr/lib64/firefox/firefox -contentproc -childID 3 -isForBrowser ... │ │ ├─1503747 /usr/lib64/firefox/firefox -contentproc -childID 4 -isForBrowser ... │ │ ├─1520219 bwrap --args 35 telegram-desktop -- │ │ ├─1520229 bwrap --args 35 xdg-dbus-proxy --args=37 │ │ ├─1520230 xdg-dbus-proxy --args=37 │ │ ├─1520232 bwrap --args 35 telegram-desktop -- │ │ ├─1520233 /app/bin/Telegram -- │ │ ├─1540753 pavucontrol ...
(Oh, what is the command to get this output?)
This is not what I am seeing here. My gnome-shell cgroup only contains the gnome-shell, Xwayland and ibus. And I have separate .scope units for Telegram (flatpak-org.telegram.desktop-2162569.scope), evolution (gnome-launched-org.gnome.Epiphany.desktop-2162690.scope), …
And I am pretty sure that flatpak/bwrap has always taken care of scoping flatpaks correctly.
Benjamin
So maybe a bug? I'll keep watching if it happens again.
Stuff started from the run dialog (alt-f2) and from the overview still seems to land in gnome-shell-wayland.service, but maybe this is fixed in gnome-shell 3.35?
This should have changed with the gnome-shell 3.34.2 update in Fedora 31. It may be that it has not reached rawhide yet though.
I'm still on gnome-shell-3.34.1-4.fc31.x86_64. I'll try the latest version.
Another issue is that things that are started through the gnome terminal also land in gnome-terminal-server.service. They need to get their own scopes to make resource allocation robust.
Do you think we should just place each VT into its own scope?
Yes. Everything starting at the shell (or whatever command is configured as the "payload", should get its own scope) and a separate set of resources than gnome-terminal-server.service.
That seems like a reasonable start in principle, though graphical applications launched from the terminal may still not be moved into their own scope then.
I think it is OK. After all, starting graphical applications from the terminal is a special case. If desired, the user may run 'systemd-run --user foo' if they want to segregate it. (Actually, we might teach some apps to put themselves into a scope when started from a command line. This makes sense for stuff like firefox, but also screen/tmux and others. But I consider this a completely separate issue.)
Zbyszek
It seems we're quite close! Do we just need to wait for another gnome release and then we'll have everything nicely segregated?
Likely not perfect, but hopefully close enough for many purposes :)
Benjamin
On Tue, 2020-01-07 at 11:44 +0100, Benjamin Berg wrote:
On Tue, 2020-01-07 at 10:21 +0000, Zbigniew Jędrzejewski-Szmek wrote:
I'm quoting from my mail from this same thread:
│ ├─gnome-shell-wayland.service │ │ ├─1501571 /usr/bin/gnome-shell │ │ ├─1501606 /usr/bin/Xwayland :0 -rootless -noreset -accessx -core -auth /run/user/1000/.mutter-Xwaylandauth.SCXID0 -listen 4 -listen 5 -displayfd 6 │ │ ├─1501713 ibus-daemon --panel disable -r --xim │ │ ├─1501718 /usr/libexec/ibus-dconf │ │ ├─1501719 /usr/libexec/ibus-extension-gtk3 │ │ ├─1501724 /usr/libexec/ibus-x11 --kill-daemon │ │ ├─1501980 /usr/libexec/ibus-engine-simple │ │ ├─1503586 /usr/lib64/firefox/firefox │ │ ├─1503691 /usr/lib64/firefox/firefox -contentproc -childID 2 -isForBrowser ... │ │ ├─1503701 /usr/lib64/firefox/firefox -contentproc -childID 3 -isForBrowser ... │ │ ├─1503747 /usr/lib64/firefox/firefox -contentproc -childID 4 -isForBrowser ... │ │ ├─1520219 bwrap --args 35 telegram-desktop -- │ │ ├─1520229 bwrap --args 35 xdg-dbus-proxy --args=37 │ │ ├─1520230 xdg-dbus-proxy --args=37 │ │ ├─1520232 bwrap --args 35 telegram-desktop -- │ │ ├─1520233 /app/bin/Telegram -- │ │ ├─1540753 pavucontrol ...
(Oh, what is the command to get this output?)
Aha, systemd-cgls, and this should be a normal F31:
│ │ ├─gnome-shell-wayland.service │ │ │ ├─2160536 /usr/bin/gnome-shell │ │ │ ├─2160575 /usr/bin/Xwayland :0 -rootless -noreset -accessx -core -auth /run/user/1000/.mutter-Xwaylandauth.4R9ED0 -listen 4 -listen 5 -displayfd 6 │ │ │ ├─2160744 ibus-daemon --panel disable -r --xim │ │ │ ├─2160754 /usr/libexec/ibus-dconf │ │ │ ├─2160755 /usr/libexec/ibus-extension-gtk3 │ │ │ ├─2160759 /usr/libexec/ibus-x11 --kill-daemon │ │ │ └─2160998 /usr/libexec/ibus-engine-simple
Benjamin
On 1/7/20 11:07 AM, Benjamin Berg wrote:
On Tue, 2020-01-07 at 09:47 +0000, Zbigniew Jędrzejewski-Szmek wrote:
On Mon, Jan 06, 2020 at 02:53:13PM -0600, Michael Catanzaro wrote:
On Mon, Jan 6, 2020 at 7:09 pm, Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for
everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Asking around, I understand oomd only operates at the cgroup level, i.e. it kills an entire cgroup at once, not individual processes. So I understand this would also depend on GNOME-level work to ensure individual applications get launched in their own systemd scopes, yes?
I wanted to ask about this too... but didn't know where ;) As of today, gnome-shell in F31 seems to start almost everything as separate systemd user scopes:
various services started automaticlly like /usr/libexec/gsd-power, /usr/libexec/gsd-sound, etc.
flatpaks (this seems to be new, I had them running under gnome-shell-wayland.service last week!)
Hmm, pretty sure flatpaks have always created their own scopes.
Stuff started from the run dialog (alt-f2) and from the overview still seems to land in gnome-shell-wayland.service, but maybe this is fixed in gnome-shell 3.35?
This should have changed with the gnome-shell 3.34.2 update in Fedora 31. It may be that it has not reached rawhide yet though.
Just had a look at awesome, all applications seem to be in the same cgroup, according to systemd-cgtop. Thus if the whole cgroup would be killed, that means rather then stopping firefox if it uses to much memory, my whole session would be terminated.
Another issue is that things that are started through the gnome terminal also land in gnome-terminal-server.service. They need to get their own scopes to make resource allocation robust.
Do you think we should just place each VT into its own scope?
That seems like a reasonable start in principle, though graphical applications launched from the terminal may still not be moved into their own scope then.
It seems we're quite close! Do we just need to wait for another gnome release and then we'll have everything nicely segregated?
Likely not perfect, but hopefully close enough for many purposes :)
Benjamin
On Mo, 06.01.20 14:53, Michael Catanzaro (mcatanzaro@gnome.org) wrote:
On Mon, Jan 6, 2020 at 7:09 pm, Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Asking around, I understand oomd only operates at the cgroup level, i.e. it kills an entire cgroup at once, not individual processes. So I understand this would also depend on GNOME-level work to ensure individual applications get launched in their own systemd scopes, yes?
That would be a good idea, yes. But there'd be a knob for that in the unit files.
I mean, OOMPolicy= currently can be set to "stop", "kill" or "continue", where "stop" means "when a process of service X is OOM killed, attempt to shutdown all of X in a friendly way"; "kill" means "when a process of service X is OOM killed, forcibly kill all other processes of X too"; "continue" means "if a process of service X is OOM killed, do nothing else".
The expectation here is that most services will want "stop" but services that are more "application servers" than an individual service (think: apache with its cgi scripts or crond with its cronjobs) would set OOMPolicy=continue, since if one of their jobs misbheaves they probably should continue running.
But yeah, the focus where things are going are clearly towards making a cgroup the unit that is managed as a whole.
Lennart
-- Lennart Poettering, Berlin
On Tue, 2020-01-07 at 11:28 +0100, Lennart Poettering wrote:
On Mo, 06.01.20 14:53, Michael Catanzaro (mcatanzaro@gnome.org) wrote:
On Mon, Jan 6, 2020 at 7:09 pm, Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Asking around, I understand oomd only operates at the cgroup level, i.e. it kills an entire cgroup at once, not individual processes. So I understand this would also depend on GNOME-level work to ensure individual applications get launched in their own systemd scopes, yes?
That would be a good idea, yes. But there'd be a knob for that in the unit files.
I mean, OOMPolicy= currently can be set to "stop", "kill" or "continue", where "stop" means "when a process of service X is OOM killed, attempt to shutdown all of X in a friendly way"; "kill" means "when a process of service X is OOM killed, forcibly kill all other processes of X too"; "continue" means "if a process of service X is OOM killed, do nothing else".
Yep, changing the OOMPolicy was considered at first. But creating new scopes for spawned children is simple enough and it also solves some other issues (e.g. not killing children when gnome-shell is restarted).
The expectation here is that most services will want "stop" but services that are more "application servers" than an individual service (think: apache with its cgi scripts or crond with its cronjobs) would set OOMPolicy=continue, since if one of their jobs misbheaves they probably should continue running.
But yeah, the focus where things are going are clearly towards making a cgroup the unit that is managed as a whole.
Benjamin
Hi,
[resend this older message for the list]
On Mon, 2020-01-06 at 14:53 -0600, Michael Catanzaro wrote:
On Mon, Jan 6, 2020 at 7:09 pm, Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Asking around, I understand oomd only operates at the cgroup level, i.e. it kills an entire cgroup at once, not individual processes. So I understand this would also depend on GNOME-level work to ensure individual applications get launched in their own systemd scopes, yes?
Even if that is the case, on F31 (with GNOME 3.34.2) we do place most user processes into separate scopes[1]. This is not perfect, because it currently only affects processes launched by gnome-shell, gnome- settings-daemon and gnome-session. So everything spawned by e.g. nautilus (easily fixable) or the terminal may still end up in their parents scope.
But, I would say the cgroup separation is pretty much good enough already. So even if it is a requirement, I would not worry about it beyond making sure that some applications like nautilus get fixes.
Benjamin
[1] They are named gnome-launched-X-Y.scope and get bound to the lifetime of the session using a drop-in. Personally I also added a drop-in to limit memory consumption for Evolution that way. It tends to just disappear sometimes now. Which is kind of neat but it would be nice to also get a notification.
On Mon, Jan 6, 2020 at 7:09 pm, Lennart Poettering mzerqung@0pointer.de wrote:
- oomd currently polls some parameters in time intervals too, still. They are working on getting rid of that too, so that everything is event based via PSI. Given their own focus on servers it's not a primary goal, but still a goal.
Alexey seems really pessimistic about PSI. It looks like he expects any solution based on PSI will fail:
https://pagure.io/fedora-workstation/issue/98#comment-619086
So that seems like the most important problem right now. Looks like Benjamin has already solved the problem of isolating apps into separate systemd scopes. Alexey's concern about browser tabs is similarly solvable. But if PSI in general is too difficult to configure, this plan isn't going to work.
On Di, 07.01.20 09:27, Michael Catanzaro (mcatanzaro@gnome.org) wrote:
On Mon, Jan 6, 2020 at 7:09 pm, Lennart Poettering mzerqung@0pointer.de wrote:
- oomd currently polls some parameters in time intervals too, still. They are working on getting rid of that too, so that everything is event based via PSI. Given their own focus on servers it's not a primary goal, but still a goal.
Alexey seems really pessimistic about PSI. It looks like he expects any solution based on PSI will fail:
https://pagure.io/fedora-workstation/issue/98#comment-619086
So that seems like the most important problem right now. Looks like Benjamin has already solved the problem of isolating apps into separate systemd scopes. Alexey's concern about browser tabs is similarly solvable. But if PSI in general is too difficult to configure, this plan isn't going to work.
Well, I personally certainly trust Tejun to deliver if he says he'll deliver. He has a pretty good track record, and it's his explicit goal to make this stuff work.
Lennart
-- Lennart Poettering, Berlin
On Mon, Jan 6, 2020 at 11:09 AM Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Looks like PSI based oom killing doesn't work without swap. Therefore oomd can't be considered a universal solution. Quite a lot of developers have workstations with quite a decent amount of RAM, ~64GiB, and do not use swap at all. Server baremetal are likewise mixed, depending on workloads, and in cloud it's rare for swap to exist.
https://github.com/facebookincubator/oomd/issues/80
We think earlyoom can be adjusted to work well for both the swap and no swap use cases.
On Wed, 2020-01-08 at 12:24 -0700, Chris Murphy wrote:
On Mon, Jan 6, 2020 at 11:09 AM Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Looks like PSI based oom killing doesn't work without swap. Therefore oomd can't be considered a universal solution. Quite a lot of developers have workstations with quite a decent amount of RAM, ~64GiB, and do not use swap at all. Server baremetal are likewise mixed, depending on workloads, and in cloud it's rare for swap to exist.
https://github.com/facebookincubator/oomd/issues/80
We think earlyoom can be adjusted to work well for both the swap and no swap use cases.
But so can oomd, after all, they are willing to implement a plugin that uses the MaxAvailable heuristic. It just won't be available in the short term.
In principle, I think what we are trying to achieve here is to keep the system mostly responsive from a user perspective. This seems to imply keeping pages in main memory that belong to "important" processes.
Should oomd not manage to do this well enough out of the box, then I see two main methods we have to improve things:
* Aggressively kill when we think important pages might get evicted - earlyoom does this based on MemAvailable - oomd plugin could do the same if deemed the right thing * Actively protect important processes[1]: - set MemoryMin, MemoryLow on important units - limit "normal" processes more e.g. MemoryHigh for applications - in the long run: adjust the OOMScore/MemoryHigh dynamically based on whether the user is interacting with an application at the time
earlyoom does the first and has the big advantage that it can be shipped in F32. However, it is not clear to me that this aggressive heuristic is actually better overall. And even if it is, we would likely still move it into oomd in the long run.
Finally, for F32 we might already be able to improve things quite a lot simply by setting a few configuration options in GNOME systemd units.
Benjamin
[1] I do not know how well this works, so it may be nice if people experimented with it[2]. For GNOME you can easily add a systemd drop-in for various services. e.g. to protect the shell (in a wayland session) simply do:
$ systemctl edit --user gnome-shell-wayland.service [Service] MemoryMin=250M MemoryLow=500M
Which I suspect should already help a lot in many scenarios.
[2] Unfortunately, I guess that such measurements may be skewed a lot on systems that use swap due to unrelated lags. i.e. Jan Grulich's mail from earlier today titled "Lagging system with latest kernels".
On Thu, Jan 9, 2020 at 5:58 AM Benjamin Berg bberg@redhat.com wrote:
On Wed, 2020-01-08 at 12:24 -0700, Chris Murphy wrote:
On Mon, Jan 6, 2020 at 11:09 AM Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Looks like PSI based oom killing doesn't work without swap. Therefore oomd can't be considered a universal solution. Quite a lot of developers have workstations with quite a decent amount of RAM, ~64GiB, and do not use swap at all. Server baremetal are likewise mixed, depending on workloads, and in cloud it's rare for swap to exist.
https://github.com/facebookincubator/oomd/issues/80
We think earlyoom can be adjusted to work well for both the swap and no swap use cases.
But so can oomd, after all, they are willing to implement a plugin that uses the MaxAvailable heuristic. It just won't be available in the short term.
In principle, I think what we are trying to achieve here is to keep the system mostly responsive from a user perspective. This seems to imply keeping pages in main memory that belong to "important" processes.
Right, not merely clobbering a process the user ostensibly wants to run and complete. The just don't want it to take over.
Should oomd not manage to do this well enough out of the box, then I see two main methods we have to improve things:
- Aggressively kill when we think important pages might get evicted
- earlyoom does this based on MemAvailable
- oomd plugin could do the same if deemed the right thing
- Actively protect important processes[1]:
- set MemoryMin, MemoryLow on important units
- limit "normal" processes more e.g. MemoryHigh for applications
- in the long run: adjust the OOMScore/MemoryHigh dynamically based on whether the user is interacting with an application at the time
earlyoom does the first and has the big advantage that it can be shipped in F32. However, it is not clear to me that this aggressive heuristic is actually better overall. And even if it is, we would likely still move it into oomd in the long run.
I agree, although the decisions made in this release cycle can really only be made based on what we know now. Earlyoom has a chance of making this a better experience in the case where something really should be OOM killed, just sooner than the kernel's oom-killer would have. It doesn't solve the unresponsiveness problem that happens once RAM is full, but before swap reaches 10%. In any case, it's not going on a process kill spree. It's not going to magically free up a system every time its under swap duress.
I've got cases where a system is under significant duress with only 50% swap use - earlyoom does nothing for that.
Finally, for F32 we might already be able to improve things quite a lot simply by setting a few configuration options in GNOME systemd units.
Maybe. What are the risks? Is it fair to characterize this as more of optimization of existing functionality, than it is a feature? That's a technical question. Of course, if this improves responsivity of the system while under swap thrashing, it's definitely a marketable feature!
On-going related discussions on linux-mm@ list
user space unresponsive, followup: lsf/mm congestion https://marc.info/?t=157842912200003&r=1&w=2
Gist here is the kernel is working as expected. The process is asking for resources that don't exist and the kernel can't really assume either the workload or the user's intent or wish. Maybe they want the system to finish the task even if it's unusuable?
OOM killer not nearly agressive enough? https://marc.info/?t=157842987000001&r=1&w=2
Not clear where it's stuck, reclaim or waiting on a lock - more info needed.
--- Chris Murphy
On Mi, 08.01.20 12:24, Chris Murphy (lists@colorremedies.com) wrote:
On Mon, Jan 6, 2020 at 11:09 AM Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Looks like PSI based oom killing doesn't work without swap. Therefore oomd can't be considered a universal solution. Quite a lot of developers have workstations with quite a decent amount of RAM, ~64GiB, and do not use swap at all. Server baremetal are likewise mixed, depending on workloads, and in cloud it's rare for swap to exist.
https://github.com/facebookincubator/oomd/issues/80
We think earlyoom can be adjusted to work well for both the swap and no swap use cases.
Isn't rearlyoom also watching the swap metrics only?
Lennart
-- Lennart Poettering, Berlin
On Fri, Jan 10, 2020 at 2:05 AM Lennart Poettering mzerqung@0pointer.de wrote:
On Mi, 08.01.20 12:24, Chris Murphy (lists@colorremedies.com) wrote:
On Mon, Jan 6, 2020 at 11:09 AM Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Looks like PSI based oom killing doesn't work without swap. Therefore oomd can't be considered a universal solution. Quite a lot of developers have workstations with quite a decent amount of RAM, ~64GiB, and do not use swap at all. Server baremetal are likewise mixed, depending on workloads, and in cloud it's rare for swap to exist.
https://github.com/facebookincubator/oomd/issues/80
We think earlyoom can be adjusted to work well for both the swap and no swap use cases.
Isn't rearlyoom also watching the swap metrics only?
No, memory free and swap free, as a percentage. Super simplistic. If there is no swap, then the percent only applies to MemAvailable.
https://pagure.io/fedora-workstation/issue/119#comment-619749
On Wed, Jan 8, 2020 at 8:25 PM Chris Murphy lists@colorremedies.com wrote:
On Mon, Jan 6, 2020 at 11:09 AM Lennart Poettering mzerqung@0pointer.de wrote:
- facebook is working on making oomd something that just works for everyone, they are in the final rounds of canonicalizing the configuration so that it can just work for all workloads without tuning. The last bits for this to be deployable are currently being done on the kernel side ("iocost"), when that's in, they'll submit oomd (or simplified parts of it) to systemd, so that it's just there and works. It's their expressive intention to make this something that also works for desktop stuff and requires no further tuning. they also will do the systemd work necessary. time frame: half a year, maybe one year, but no guarantees.
Looks like PSI based oom killing doesn't work without swap. Therefore oomd can't be considered a universal solution. Quite a lot of developers have workstations with quite a decent amount of RAM, ~64GiB, and do not use swap at all. Server baremetal are likewise mixed, depending on workloads, and in cloud it's rare for swap to exist.
https://github.com/facebookincubator/oomd/issues/80
We think earlyoom can be adjusted to work well for both the swap and no swap use cases.
How? On a system with 64GB of ram and no swap all it does currently is reducing the amount of usable memory significantly.
On Wednesday, January 8, 2020 12:24:23 PM MST Chris Murphy wrote:
Looks like PSI based oom killing doesn't work without swap. Therefore oomd can't be considered a universal solution. Quite a lot of developers have workstations with quite a decent amount of RAM, ~64GiB, and do not use swap at all. Server baremetal are likewise mixed, depending on workloads, and in cloud it's rare for swap to exist.
While it's true some systems without swap are possible, it's not common, except in the case of virtual machines, in which case it depends heavily on the vendor.
On Wed, Jan 15, 2020 at 6:40 PM John M. Harris Jr johnmh@splentity.com wrote:
On Wednesday, January 8, 2020 12:24:23 PM MST Chris Murphy wrote:
Looks like PSI based oom killing doesn't work without swap. Therefore oomd can't be considered a universal solution. Quite a lot of developers have workstations with quite a decent amount of RAM, ~64GiB, and do not use swap at all. Server baremetal are likewise mixed, depending on workloads, and in cloud it's rare for swap to exist.
While it's true some systems without swap are possible, it's not common, except in the case of virtual machines, in which case it depends heavily on the vendor.
Nearly all of our servers don't have swap in $DAYJOB's server fleet (thousands of servers). In my experience, it's an increasingly common thing.
For now, kernel developers have made it clear they do not care about user space responsiveness. At all. Their concern with kernel oom-killer is strictly with keeping the kernel functioning.
This is false. The stated purpose of the OOM killer is not only to keep the kernel alive. Nor does the fact the kernel has not solved userspace responsiveness yet imply that kernel folks do not care. Rather, it means that they will not solve it on their own because the kernel does not have all the information it needs. Kernel folks do care, or we wouldn’t have PSI or cgroups. A userspace solution is needed, but does not need to replace the OOM killer; cgroups are also a userspace solution. If earlyoom breaks them, it can make things worse than the status quo.
Can it be done with cgroupv2 and PSI alone? Unclear.
Of course it can. Just run 100 instances of every stress-ng memory worker in a podman container with a cgroup memory limit. The system will not hang. Do the same without the memory limit. The system will hang within seconds and never recover. Thus demonstrating that cgroups work and do the things they were intended to do.
Try it. With a memory limit,
podman run --rm -it --memory=1G fedora bash -c 'dnf install -y stress-ng && stress-ng --malloc 100 --memcpy 100 --mmap 100 --vm 100'
will use CPU but keep your system responsive. Without the memory limit (this will hang your system),
podman run --rm -it fedora bash -c 'dnf install -y stress-ng && stress-ng --malloc 100 --memcpy 100 --mmap 100 --vm 100'
the system hangs and doesn’t recover after 15 minutes. Same thing with `tail /dev/zero`:
podman run --rm -it --memory=1G fedora tail /dev/zero
activates the OOM killer after three seconds, with
kernel: Memory cgroup out of memory: Killed process 8814 (tail) total-vm:3141408kB, anon-rss:1042028kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:6336512kB oom_score_adj:0 systemd[943]: libpod-e061e1cb57dde204632531a556d37efbd51c9ab67346a8bc4d5e26c7301c165b.scope: A process of this unit has been killed by the OOM killer.kernel: oom_reaper: reaped process 8814 (tail), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
logged in the system journal. You were saying the OOM killer activates too late and rarely kills the right process? Well, here it activates early enough and knows exactly what to stop. It is worth trying with ninja and WebKit too.
On Tue, Jan 7, 2020 at 5:27 am, Mark Otaris mark@net-c.com wrote:
Try it. With a memory limit,
podman run --rm -it --memory=1G fedora bash -c 'dnf install -y stress-ng && stress-ng --malloc 100 --memcpy 100 --mmap 100 --vm 100'
will use CPU but keep your system responsive. Without the memory limit (this will hang your system),
podman run --rm -it fedora bash -c 'dnf install -y stress-ng && stress-ng --malloc 100 --memcpy 100 --mmap 100 --vm 100'
the system hangs and doesn’t recover after 15 minutes.
I don't think we can use this, though; or at least, I don't see how. systemd allows limiting the memory accessible to a scope, but it doesn't allow carving out memory for one particular scope that is not to be accessible to other scopes. So I don't see a way to use these memory limits to ensure sufficient memory remains available to critical session processes. (Am I missing something?)
On Tue, Jan 07, 2020 at 09:19:47AM -0600, Michael Catanzaro wrote:
On Tue, Jan 7, 2020 at 5:27 am, Mark Otaris mark@net-c.com wrote:
Try it. With a memory limit,
podman run --rm -it --memory=1G fedora bash -c 'dnf install -y stress-ng && stress-ng --malloc 100 --memcpy 100 --mmap 100 --vm 100'
will use CPU but keep your system responsive. Without the memory limit (this will hang your system),
podman run --rm -it fedora bash -c 'dnf install -y stress-ng && stress-ng --malloc 100 --memcpy 100 --mmap 100 --vm 100'
the system hangs and doesn’t recover after 15 minutes.
I don't think we can use this, though; or at least, I don't see how. systemd allows limiting the memory accessible to a scope, but it doesn't allow carving out memory for one particular scope that is not to be accessible to other scopes. So I don't see a way to use these memory limits to ensure sufficient memory remains available to critical session processes. (Am I missing something?)
systemd is just a proxy for the kernel here. The kernel allows memory.min to be set, which is defined as [1]
Hard memory protection. If the memory usage of a cgroup is within its effective min boundary, the cgroup’s memory won’t be reclaimed under any conditions.
There is also memory.low which is weaker:
Best-effort memory protection. If the memory usage of a cgroup is within its effective low boundary, the cgroup’s memory won’t be reclaimed unless there is no reclaimable memory available in unprotected cgroups.
I think that a combination of those two settings could be sufficient to give us appropriate memory protection for a graphical session. I envision the limits as being set using some simple formula based on the total RAM available and the desktop environment used at machine boot.
[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-int...
Zbyszek
On Mon, Jan 6, 2020 at 10:28 PM Mark Otaris mark@net-c.com wrote:
For now, kernel developers have made it clear they do not care about user space responsiveness. At all. Their concern with kernel oom-killer is strictly with keeping the kernel functioning.
This is false. The stated purpose of the OOM killer is not only to keep the kernel alive.
https://lore.kernel.org/linux-fsdevel/20200104090955.GF23195@dread.disaster....
Nor does the fact the kernel has not solved userspace responsiveness yet imply that kernel folks do not care. Rather, it means that they will not solve it on their own because the kernel does not have all the information it needs. Kernel folks do care, or we wouldn’t have PSI or cgroups.
OK.
Can it be done with cgroupv2 and PSI alone? Unclear.
Of course it can. Just run 100 instances of every stress-ng memory worker in a podman container with a cgroup memory limit. The system will not hang.
a. Not everything is running or will run in a container; b. To what degree cgroups and PSI, making no other changes, solves/avoids the problem under discussion is workload dependent. That's stated in the last part of the email response I reference above.
Of course, Fedora Workstation is a general purpose operating system. It's untenable to have workload specific operating systems. By what mechanism is the workload categorized? And by what mechanism is the system dynamical (re)configured?
Try it. With a memory limit,
podman run --rm -it --memory=1G fedora bash -c 'dnf install -y stress-ng && stress-ng --malloc 100 --memcpy 100 --mmap 100 --vm 100'
When I ask the question "can it be done" I'm asking for cake and eating it too. I'm not asking for an example of running something that doesn't implode the system, I know about that. I want to compile something, and have the system figure out the resources it can give to that task, without killing it, and without impacting the responsivity of my computer.
On Tue, Jan 7, 2020 at 8:55 AM Chris Murphy lists@colorremedies.com wrote:
On Mon, Jan 6, 2020 at 10:28 PM Mark Otaris mark@net-c.com wrote:
For now, kernel developers have made it clear they do not care about user space responsiveness. At all. Their concern with kernel oom-killer is strictly with keeping the kernel functioning.
This is false. The stated purpose of the OOM killer is not only to keep the kernel alive.
https://lore.kernel.org/linux-fsdevel/20200104090955.GF23195@dread.disaster....
Sorry, that's a long email and set of threads. As it relates to the above, the phrase to search for is: "This is indeed the case."
I intended to demonstrate that cgroups can be used to cause the kernel OOM killer to react appropriately and fast enough, implying that replacing the OOM killer is not necessary and that replacing it by a userspace OOM killer that does not account for cgroups can be undesirable. The exact same controls set with my example commands, and others, can be set with scopes as well, so this should be applicable.
https://lore.kernel.org/linux-fsdevel/20200104090955.GF23195@dread.disaster....
Okay, interesting. But that’s a statement from just one person, and it has to be interpreted in the context of what it is confirming; that is, that the OOM killer is “mainly concerned about kernel survival in low memory situations”, which is weaker than your claim that “their concern with kernel oom-killer is strictly with keeping the kernel functioning”. I don’t know if the OOM killer’s main purpose is to keep the kernel alive (Michal Hocko appears to think so, maybe others disagree), but it is in any case not an abuse of the OOM killer to also use it to keep userspace responsive, and there is no reason to think that kernel folks are not interested in helping achieve this goal. The only advantage I see to earlyoom so far is that it sends SIGTERM before taking further steps that will kill processes.
On Tue, Jan 7, 2020 at 1:48 PM Mark Otaris mark@net-c.com wrote:
I intended to demonstrate that cgroups can be used to cause the kernel OOM killer to react appropriately and fast enough, implying that replacing the OOM killer is not necessary and that replacing it by a userspace OOM killer that does not account for cgroups can be undesirable. The exact same controls set with my example commands, and others, can be set with scopes as well, so this should be applicable.
https://lore.kernel.org/linux-fsdevel/20200104090955.GF23195@dread.disaster....
Okay, interesting. But that’s a statement from just one person, and it has to be interpreted in the context of what it is confirming; that is, that the OOM killer is “mainly concerned about kernel survival in low memory situations”, which is weaker than your claim that “their concern with kernel oom-killer is strictly with keeping the kernel functioning”. I don’t know if the OOM killer’s main purpose is to keep the kernel alive (Michal Hocko appears to think so, maybe others disagree), but it is in any case not an abuse of the OOM killer to also use it to keep userspace responsive,
The oom killer doesn't keep user space responsive per se, in your example that's done by cgroups restricting resources. And that's neat, and necessary to keep making forward progress on. But we don't have that for unprivileged process right now, unless the user knows the secret decoder ring command to use to do this every time they run something in Terminal; and then have some idea to hint at what resources are needed for the task to succeed rather than just get clobbered anyway.
That's maybe the elephant in the room with earlyoom (or one of them), yes we've recovered sooner, the user can hopefully save their data and reboot. But did their task succeed? No. It got clobbered.
and there is no reason to think that kernel folks are not interested in helping achieve this goal.
I did mean with a kernel only solution. I've been tracking this issue for 6-7 months including the congestion and kswapd discussions on-going, so I know they do care broadly about providing some mechanisms by which user space can better behave. But all of that requires varying degrees of opt-in, and quite a lot of it involves considerable work to even understand it, let alone implement it.
The only advantage I see to earlyoom so far is that it sends SIGTERM before taking further steps that will kill processes.
Yes and it happens sooner. Probably not soon enough for many users. There may be some risk by overpromising and under delivering: by making it the default and then for the vast majority of cases it doesn't matter, because users are long since conditioned to just force power off within a minute or less of the GUI stuttering or freezing up on them. It is very workload and system specific.
On Mon, Jan 6, 2020 at 11:07 am, Lennart Poettering mzerqung@0pointer.de wrote:
Hmm, are we sure this is something we want to have in the default install? Is the code really good enough for that?
Looking at the sources very superficially I see a couple of problems:
- Waking up all the time in 100ms intervals? We generally try to avoid waking the CPU up all the time if nothing happens. Saving power and things.
Is there a way to check memory usage without periodic wakeups?
In WebKit we wake up every 5s to check memory usage if we saw low memory usage on the last wakeup, or every 1s if it was high, with a scale in between. Would be good to experiment with the timings and see how long we can get away with before the polling interval is too large to prevent system lockups. (The WebKit timings are designed for cache clearing, not for maintaining system responsiveness, so I wouldn't trust those.)
New code using system() in the year 2020? Really?
Fixed size buffers and implicit, undetected, truncation of strings at various places (for example, when formatting the shell string to pass to system()).
Thanks. The code review is much appreciated. If we're going to be running a superuser deamon, then we need to be confident that it doesn't do these dangerous things. And these choices do raise quality concerns about what might be lurking in the rest of the code, as well.
But more importantly: are we sure this actually operates the way we should? i.e. PSI is really what should be watched. It is not interesting who uses how much memory and triggering kills on that. What matters is to detect when the system becomes slow due to that, i.e. *latencies* introduced due to memory pressure and that's what PSI is about, and hence what should be used.
My understanding is that experiments with PSI have indicated that it's hard to make it work well in practice. Alexey (hakavlad) has investigated this topic extensively, and his conclusion was:
"PSI-based process killing should not be used by default, because this topic is still poorly understood and we don’t know what thresholds are desirable for most users: it’s hard to find good default values."
https://pagure.io/fedora-workstation/issue/98#comment-615425
Details at: https://github.com/rfjakob/earlyoom/issues/100
So: already considered, but rejected for now.
But even if we'd ignore that in order fight latencies one should watch latencies: OOM killing per process is just not appropriate on a systemd system: all our system services (and a good chunk of our user services too) are sorted neatly into cgroups, and we really should kill them as a whole and not just individual processes inside them. systemd manages that today, and makes exceptions configurable via OOMPolicy=, and with your earlyoom stuff you break that.
This looks like second guessing the kernel memory management folks at a place where one can only lose, and at the time breaking correct OOM reporting by the kernel via cgroups and stuff.
I think it's very clear at this point that this is extremely unlikely to be fixed at the kernel level. If that changes, great, but in the meantime we need a userspace solution to prevent Fedora from locking up. The Workstation WG doesn't have much (any?) kernel development experience, and we're aware that historical discussions on fixing this issue at the kernel level have concluded negatively, so we're limiting our interest to userspace solutions.
I think everybody would be happy to hold off on userspace solutions if a kernel solution is in the works. I'd love to see kernel devs acknowledge the issue, using the same test cases that we're using (either 'ninja build' WebKit or simply 'tail /dev/zero'), and propose a real concrete solution. But I'm not going to hold my breath for that. My understanding is that previous discussions have concluded that the kernel OOM is designed to ensure enough memory remains available to the kernel, and that userspace is responsible for determining how to keep userspace responsive.
Also: what precisely is this even supposed to do? Replace the algorithm for detecting *when* to go on a kill rampage? Or actually replace the algorithm selecting *what* to kill during a kill rampage?
earlyoom is restricted to the former, although in the future we might be interested in doing the later as well, either by enhancing earlyoom or switching to another tool, e.g. to prevent sshd or journald from being killed.
If it's the former (which the name of the project suggests, _early_oom)), then at the most basic the tool should let the kernel do the killing, i.e. "echo f > /proc/sysrq-trigger". That way the reporting via cgroups isn't fucked, and systemd can still do its thing, and the kernel can kill per cgroup rather than per process...
Problem is that letting the kernel do the work can cause data loss. earlyoom needs to handle process termination itself so that it can send SIGTERM first, instead of jumping straight to SIGKILL and corrupting who knows what.
Michael
On Mo, 06.01.20 10:09, Michael Catanzaro (mcatanzaro@gnome.org) wrote:
Is there a way to check memory usage without periodic wakeups?
PSI. It measures latency though. Which is the right thing to measure here... You can configure thresholds there and it wakes you up when those are hit. Thus userspace doesn't have to poll at all...
In WebKit we wake up every 5s to check memory usage if we saw low memory usage on the last wakeup, or every 1s if it was high, with a scale in between. Would be good to experiment with the timings and see how long we can get away with before the polling interval is too large to prevent system lockups. (The WebKit timings are designed for cache clearing, not for maintaining system responsiveness, so I wouldn't trust those.)
Watch things with PSI.
New code using system() in the year 2020? Really?
Fixed size buffers and implicit, undetected, truncation of strings at various places (for example, when formatting the shell string to pass to system()).
Thanks. The code review is much appreciated. If we're going to be running a superuser deamon, then we need to be confident that it doesn't do these dangerous things. And these choices do raise quality concerns about what might be lurking in the rest of the code, as well.
BTW, this should not be a root daemon anyway. It only needs one cap: CAP_SYS_KILL. Hence, drop privs to some user of its own, and keep that one cap. Use AmbientCapabilities= in the unit file.
My understanding is that experiments with PSI have indicated that it's hard to make it work well in practice. Alexey (hakavlad) has investigated this topic extensively, and his conclusion was:
"PSI-based process killing should not be used by default, because this topic is still poorly understood and we don’t know what thresholds are desirable for most users: it’s hard to find good default values."
If things are poorly understood, then understand them better... Don't just adopt some stuff that isn't much better understood either...
But even if we'd ignore that in order fight latencies one should watch latencies: OOM killing per process is just not appropriate on a systemd system: all our system services (and a good chunk of our user services too) are sorted neatly into cgroups, and we really should kill them as a whole and not just individual processes inside them. systemd manages that today, and makes exceptions configurable via OOMPolicy=, and with your earlyoom stuff you break that.
This looks like second guessing the kernel memory management folks at a place where one can only lose, and at the time breaking correct OOM reporting by the kernel via cgroups and stuff.
I think it's very clear at this point that this is extremely unlikely to be fixed at the kernel level. If that changes, great, but in the meantime we need a userspace solution to prevent Fedora from locking up. The Workstation WG doesn't have much (any?) kernel development experience, and we're aware that historical discussions on fixing this issue at the kernel level have concluded negatively, so we're limiting our interest to userspace solutions.
Well, it's not that the kernel folks wouldn't provide you with some tools to improve the situation, see PSI...
Also: what precisely is this even supposed to do? Replace the algorithm for detecting *when* to go on a kill rampage? Or actually replace the algorithm selecting *what* to kill during a kill rampage?
earlyoom is restricted to the former, although in the future we might be interested in doing the later as well, either by enhancing earlyoom or switching to another tool, e.g. to prevent sshd or journald from being killed.
These services should set the OOMScoreAdjust= value to something sensible. journald and udevd do that. maybe ssh should do too... (it's a bit harder for ssh, since it needs to undo the setting for its sessions again, since oom scores are propagated down the process tree)
If it's the former (which the name of the project suggests, _early_oom)), then at the most basic the tool should let the kernel do the killing, i.e. "echo f > /proc/sysrq-trigger". That way the reporting via cgroups isn't fucked, and systemd can still do its thing, and the kernel can kill per cgroup rather than per process...
Problem is that letting the kernel do the work can cause data loss. earlyoom needs to handle process termination itself so that it can send SIGTERM first, instead of jumping straight to SIGKILL and corrupting who knows what.
Well, then tell systemd to do it for you... Use the D-Bus call GetUnitByPID() and then issue StopUnit().
Lennart
-- Lennart Poettering, Berlin
On Fri, Jan 3, 2020 at 8:20 PM Ben Cotton bcotton@redhat.com wrote:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
I've read the whole thread (phew!) and I support the proposal. The user experience is improved and I don't see any substantial disadvantages (power management etc can hopefully be fine-tuned). Of course the code should be well inspected by someone knowledgeable, if it's going to run with high privileges. And if there are serious candidates with a better approach (e.g. something from systemd), it might make sense to delay this and wait a while. OTOH, if verifying the code and setting it up is not that much work, those candidates can *replace* early-oom in the future, and no delay is necessary. Overall +1 from me.
Yesterday, I have updated my Rawhide and wondered why `dnf autoremove` would want to remove earlyoom just to discover that soft dependency in earlyoom was dropped [1] and hence nothing requires earlyoom and DNF is free to remove this package (and it is possibly not installed anymore on upgraded systems).
Therefore I wonder what is the status of EarlyOOM. Should I let the package go? If not, then the situation should be fixed somehow, probably either by reverting the revert or adding the dependency into fedora-release as was proposed elsewhere.
Vít
[1] https://src.fedoraproject.org/rpms/earlyoom/c/a6d0f45a3524830642a4120704e8d2...
Dne 03. 01. 20 v 20:18 Ben Cotton napsal(a):
https://fedoraproject.org/wiki/Changes/EnableEarlyoom
== Summary == Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.
== Owner ==
- Name: [[User:chrismurphy| Chris Murphy]]
- Email: bugzilla@colorremedies.com
== Detailed Description == Workstation working group has discussed "better interactivity in low-memory situations" for some months. In certain use cases, typically compiling, if all RAM and swap are completely consumed, system responsiveness becomes so abysmal that a reasonable user can consider the system "lost", and resorts to forcing a power off. This is objective a very bad UX. The broad discussion of this problem, and some ideas for near term and long term solutions, is located here:
Recent long discussions on "Better interactivity in low-memory situations"<br> https://pagure.io/fedora-workstation/issue/98<br> https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/...<br>
Fedora editions and spins, have the in-kernel OOM (out-of-memory) manager enabled. The manager's concern is keeping the kernel itself functioning. It has no concern about user space function or interactivity. This proposed change attempts to improve the user experience, in the short term, by triggering the in-kernel process killing mechanism, sooner. Instead of the system becoming completely unresponsive for tens of minutes, hours or days, the expectation is an offending process (determined by oom_score, same as now) will be killed off within seconds or a few minutes. This is an incremental improvement in user experience, but admittedly still suboptimal. There is additional work on-going to improve the user experience further.
Workstation working group discussion specific to enabling earlyoom by default https://pagure.io/fedora-workstation/issue/119
Other in-progress solutions:<br> https://gitlab.freedesktop.org/hadess/low-memory-monitor<br>
Background information on this complicated problem:<br> https://www.kernel.org/doc/gorman/html/understand/understand016.html<br> https://lwn.net/Articles/317814/<br>
== Benefit to Fedora ==
There are two major benefits to Fedora:
- improved user experience by more quickly regaining control over
one's system, rather than having to force power off in low-memory situations where there's aggressive swapping. Once a system becomes unresponsive, it's completely reasonable for the user to assume the system is lost, but that includes high potential for data loss.
- reducing forced poweroff as the main work around will increase data
collection, improving understanding of low memory situations and how to handle them better
== Scope ==
- Proposal owners:
a. Modify {{code|https://pagure.io/fedora-comps/blob/master/f/comps-f32.xml.in%7D%7D to include earlyoom package for Workstation.<br> b. Modify {{code|https://src.fedoraproject.org/rpms/fedora-release/blob/master/f/80-workstati... to include:
<pre> # enable earlyoom by default on workstation enable earlyoom.service </pre>
- Other developers:
Restricted to Workstation edition, unless other editions/spins want to opt-in.
- Release engineering: [https://pagure.io/releng/issues #9141] (a
check of an impact with Release Engineering is needed) <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
- Policies and guidelines: N/A
- Trademark approval: N/A
== Upgrade/compatibility impact == earlyoom.service will be enabled on upgrade. An upgraded system should exhibit the same behaviors as a clean installed system.
== How To Test ==
- Fedora 30/31 users can test today, any edition or spin:<br>
{{code|sudo dnf install earlyoom}}<br> {{code|sudo systemctl enable --now earlyoom}}
And then attempt to cause an out of memory situation. Examples:<br> {{code|tail /dev/zero}}<br> {{code|https://lkml.org/lkml/2019/8/4/15%7D%7D
- Fedora Workstation 32 (and Rawhide) users will see this service is
already enabled. It can be toggled with {{code|sudo systemctl start/stop earlyoom}} where start means earlyoom is running, and stop means earlyoom is not running.
== User Experience == The most egregious instances this change is trying to mitigate: a. RAM is completely used b. Swap is completely used c. System becomes unresponsive to the user as swap thrashing has ensued --> earlyoom disabled, the user often gives up and forces power off (in my own testing this condition lasts >30 minutes with no kernel triggered oom killer and no recovery) --> earlyoom enabled, the system likely still becomes unresponsive but oom killer is triggered in much less time (seconds or a few minutes, in my testing, after less than 10% RAM and 10% swap is remaining)
earlyoom starts sending SIGTERM once both memory and swap are below their respective PERCENT setting, default 10%. It sends SIGKILL once both are below their respective KILL_PERCENT setting, default 5%.
The package includes configuration file /etc/default/earlyoom which sets option {{code|-r 60}} causing a memory report to be entered into the journal every minute.
== Dependencies == earlyoom package has no dependencies
== Contingency Plan ==
- Contingency mechanism: Owner will revert all changes
- Contingency deadline: Final freeze
- Blocks release? No
- Blocks product? No
== Documentation == {{code|man earlyoom}}<br><br> https://www.kernel.org/doc/gorman/html/understand/understand016.html
== Release Notes == Earlyoom service is enabled by default, which will cause kernel oom-killer to trigger sooner. To revert to previous behavior:<br> {{code|sudo systemctl disable earlyoom.service}}
And to customize see {{code|man earlyoom}}.
On Tue, Aug 4, 2020 at 10:45 am, Vít Ondruch vondruch@redhat.com wrote:
Yesterday, I have updated my Rawhide and wondered why `dnf autoremove` would want to remove earlyoom just to discover that soft dependency in earlyoom was dropped [1] and hence nothing requires earlyoom and DNF is free to remove this package (and it is possibly not installed anymore on upgraded systems).
Therefore I wonder what is the status of EarlyOOM. Should I let the package go? If not, then the situation should be fixed somehow, probably either by reverting the revert or adding the dependency into fedora-release as was proposed elsewhere.
We're tracking this problem in https://pagure.io/fedora-workstation/issue/138 and https://bugzilla.redhat.com/show_bug.cgi?id=1814306. It's high priority for Workstation, but it's blocked on dnf. We've been working around it in an ad-hoc way for each package we add in a different way in every release. In this case, I removed our original workaround in https://src.fedoraproject.org/rpms/earlyoom/pull-request/2 because we intended to replace it with a new workaround, https://src.fedoraproject.org/fork/catanzaro/rpms/fedora-release/c/a0df346ba.... However, we decided the new workaround was a little outrageous and we would just wait for a dnf fix instead. In the meantime, if you want to keep earlyoom, don't use autoremove. In the meantime, it will get pulled in on upgrades to F32 due to the old workaround, but it's not currently being pulled in on upgrades to F33.
Michael
Dne 04. 08. 20 v 16:05 Vitaly Zaitsev via devel napsal(a):
On 04.08.2020 15:48, Michael Catanzaro wrote:
In the meantime, if you want to keep earlyoom, don't use autoremove.
sudo dnf mark install earlyoom
I think the "don't use autoremove" is better suggestion ATM, because I don't really want to keep earlyoom on the system in case there is systemd-oomd or whatever should be the successor.
Vít
On Tue, Aug 4, 2020 at 8:46 AM Vít Ondruch vondruch@redhat.com wrote:
Dne 04. 08. 20 v 16:05 Vitaly Zaitsev via devel napsal(a):
On 04.08.2020 15:48, Michael Catanzaro wrote:
In the meantime, if you want to keep earlyoom, don't use autoremove.
sudo dnf mark install earlyoom
I think the "don't use autoremove" is better suggestion ATM, because I don't really want to keep earlyoom on the system in case there is systemd-oomd or whatever should be the successor.
systemd-oomd is coming along but it'll be Fedora 34 most likely, but possibly Fedora 35.
Hopefully there will be a way to do some kind of "rebase" where recommended things are favored on upgrades (to a new release version), but without having to obsolete them, and not applied to each update within a given release.
On 04.08.2020 16:45, Vít Ondruch wrote:
I think the "don't use autoremove" is better suggestion ATM, because I don't really want to keep earlyoom on the system in case there is systemd-oomd or whatever should be the successor.
You can always easily swap one package to another:
sudo dnf swap earlyoom systemd-oomd --allowerasing
Dne 04. 08. 20 v 20:58 Vitaly Zaitsev via devel napsal(a):
On 04.08.2020 16:45, Vít Ondruch wrote:
I think the "don't use autoremove" is better suggestion ATM, because I don't really want to keep earlyoom on the system in case there is systemd-oomd or whatever should be the successor.
You can always easily swap one package to another:
sudo dnf swap earlyoom systemd-oomd --allowerasing
I know I can swap packages and what not, but primarily I want to keep my system in "default" state, mostly following the changes Fedora contributors are proposing. So if the proposal is to have earlyoom installed by default, then it is surprising it might not be installed. This situation should be fixed generally without me changing anything. And that is the reason I bumped this thread.
Vít
On Tue, Aug 4, 2020 at 7:49 AM Michael Catanzaro mcatanzaro@gnome.org wrote:
In the meantime, it will get pulled in on upgrades to F32 due to the old workaround, but it's not currently being pulled in on upgrades to F33.
Should we go back to the old workaround for F33? Madness for one more release? And then drop the madness once there's a dnf solution?
By the way... just in case it matters https://src.fedoraproject.org/fork/catanzaro/rpms/fedora-release/c/a0df346ba...
line 562 should be `zram-generator-defaults` - 'zram' will be obsoleted by 'zram-generator-defaults` in F33.
On Tue, Aug 4, 2020 at 10:31 am, Chris Murphy lists@colorremedies.com wrote:
Should we go back to the old workaround for F33? Madness for one more release? And then drop the madness once there's a dnf solution?
We could, but we have installed so many other things that it's becoming quite hard to keep track of them all, and if we're going to have a workaround for any one package I would recommend we use the same workaround for them all. And that's the merge request I have above. And for that to work, we would need to require that anyone touching comps also add a corresponding Recommends: in fedora-release. That would be unfortunate.
I'd rather have a proper dnf fix in place for F33.
Dne 04. 08. 20 v 21:38 Michael Catanzaro napsal(a):
On Tue, Aug 4, 2020 at 10:31 am, Chris Murphy lists@colorremedies.com wrote:
Should we go back to the old workaround for F33? Madness for one more release? And then drop the madness once there's a dnf solution?
We could, but we have installed so many other things that it's becoming quite hard to keep track of them all, and if we're going to have a workaround for any one package I would recommend we use the same workaround for them all. And that's the merge request I have above. And for that to work, we would need to require that anyone touching comps also add a corresponding Recommends: in fedora-release. That would be unfortunate.
Wouldn't it be better to replace this part of comps by soft dependencies? I quite don't understand why we have not dropped comps (at leas for the use case of installation basic OS) when we got soft dependencies in RPM.
Admittedly, the soft dependencies would be repeatedly installed compared to comps, but now you are asking DNF to actually install the content of comps repetitively. So there won't be difference at the end.
Vít
I'd rather have a proper dnf fix in place for F33.
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Tuesday, August 4, 2020 1:45:52 AM MST Vít Ondruch wrote:
Yesterday, I have updated my Rawhide and wondered why `dnf autoremove` would want to remove earlyoom just to discover that soft dependency in earlyoom was dropped [1] and hence nothing requires earlyoom and DNF is free to remove this package (and it is possibly not installed anymore on upgraded systems).
Therefore I wonder what is the status of EarlyOOM. Should I let the package go? If not, then the situation should be fixed somehow, probably either by reverting the revert or adding the dependency into fedora-release as was proposed elsewhere.
Generally, if you let the package go, your system won't suffer from your processes getting killed needlessly. This is likely a benefit, so I don't know if this is really a bug.