Hi server@ and cloud@ folks,
There is a system-wide change to enable earlyoom by default on Fedora Workstation. It came up in today's Workstation working group meeting that I should give you folks a heads up about opting into this change.
Proposal https://fedoraproject.org/wiki/Changes/EnableEarlyoom Devel@ discussion https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/...
The main issue on a workstation, heavy swap leading to an unresponsive system, is perhaps not as immediately frustrating on a server. But the consequences of indefinite hang or the kernel oom-killer triggering, which is a SIGKILL, are perhaps worse.
On the plus side, earlyoom is easy to understand, and its first attempt is a SIGTERM rather than SIGKILL. It uses oom_score, same as kernel oom-killer, to determine the victim.
The SIGTERM is issued to the process with the highest oom_score only if both memory and swap reach 10% free. And SIGKILL is issued to the process with the highest oom_score once memory and swap reach 5% free. Those percentages can be tweaked, but the KILL percentage is always 1/2 of the TERM percentage, so it's a bit rudimentary.
One small concern I have is, what if there's no swap? That's probably uncommon for servers, but I'm not sure about cloud. But in this case, SIGTERM happens at 10% of RAM, which leaves a lot of memory on the table, and for a server with significant resources it's probably too high. What about 4%? Maybe still too high? One option I'm thinking of is a systemd conditional that would not run earlyoom on systems without a swap device, which would leave these systems no worse off than they are right now. [i.e. they eventually recover (?), indefinitely hang (likely), or oom-killer finally kills something (less likely).]
I've been testing earlyoom, nohang, and the kernel oom-killer for > 6 months now, and I think it would be completely sane for Server and Cloud products to enable earlyoom by default for fc32, while evaluating other solutions that can be more server oriented (e.g. nohang, oomd, possibly others) for fc33/fc34. What is clear: this isn't going to be solved by kernel folks, the kernel oom-killer only cares about keeping the kernel alive, it doesn't care about user space at all.
In the cases where this becomes a problem, either the kernel hangs indefinitely or does SIGKILL for your database or whatever is eating up resources. Whereas at least earlyoom's first attempt is a SIGTERM so it has a chance of gracefully quitting.
There are some concerns, those are in the devel@ thread, and I expect they'll be adequately addressed or the feature will not pass the FESCo vote. But as a short term solution while evaluating more sophisticated solutions, I think this is a good call so I thought I'd just mention it, in case you folks want to be included in the change.
On 1/6/20 1:18 PM, Chris Murphy wrote:
Hi server@ and cloud@ folks,
There is a system-wide change to enable earlyoom by default on Fedora Workstation. It came up in today's Workstation working group meeting that I should give you folks a heads up about opting into this change.
Thanks for the heads up!
Proposal https://fedoraproject.org/wiki/Changes/EnableEarlyoom Devel@ discussion https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/...
The main issue on a workstation, heavy swap leading to an unresponsive system, is perhaps not as immediately frustrating on a server. But the consequences of indefinite hang or the kernel oom-killer triggering, which is a SIGKILL, are perhaps worse.
On the plus side, earlyoom is easy to understand, and its first attempt is a SIGTERM rather than SIGKILL. It uses oom_score, same as kernel oom-killer, to determine the victim.
The SIGTERM is issued to the process with the highest oom_score only if both memory and swap reach 10% free. And SIGKILL is issued to the process with the highest oom_score once memory and swap reach 5% free. Those percentages can be tweaked, but the KILL percentage is always 1/2 of the TERM percentage, so it's a bit rudimentary.
Yeah. Adding more ways to relate SIGTERM to SIGKILL (other the 1/2) would be nice.
One small concern I have is, what if there's no swap? That's probably uncommon for servers, but I'm not sure about cloud. But in this case,
For cloud at least it's very common to not have swap. I'd argue for servers you don't want them swapping either but resources aren't quite as elastic as in the cloud so you might not be able to burst resources like you can in the cloud.
SIGTERM happens at 10% of RAM, which leaves a lot of memory on the table, and for a server with significant resources it's probably too high. What about 4%? Maybe still too high? One option I'm thinking of is a systemd conditional that would not run earlyoom on systems without a swap device, which would leave these systems no worse off than they are right now. [i.e. they eventually recover (?), indefinitely hang (likely), or oom-killer finally kills something (less likely).]
Seems like it on these systems it would nice to make earlyoom SIGTERM just right before SIGKILL. i.e. try the nice way and then bring in the hammer. In this case a 1% difference in threshold would be useful. i.e. SIGTERM at 5% SIGKILL at 4% or something like that.
I've been testing earlyoom, nohang, and the kernel oom-killer for > 6 months now, and I think it would be completely sane for Server and Cloud products to enable earlyoom by default for fc32, while evaluating other solutions that can be more server oriented (e.g. nohang, oomd, possibly others) for fc33/fc34. What is clear: this isn't going to be solved by kernel folks, the kernel oom-killer only cares about keeping the kernel alive, it doesn't care about user space at all.
In the cases where this becomes a problem, either the kernel hangs indefinitely or does SIGKILL for your database or whatever is eating up resources. Whereas at least earlyoom's first attempt is a SIGTERM so it has a chance of gracefully quitting.
There are some concerns, those are in the devel@ thread, and I expect they'll be adequately addressed or the feature will not pass the FESCo vote. But as a short term solution while evaluating more sophisticated solutions, I think this is a good call so I thought I'd just mention it, in case you folks want to be included in the change.
Thanks!
On Mon, Jan 6, 2020 at 7:56 PM Dusty Mabe dusty@dustymabe.com wrote:
For cloud at least it's very common to not have swap. I'd argue for servers you don't want them swapping either but resources aren't quite as elastic as in the cloud so you might not be able to burst resources like you can in the cloud.
There's also discussion about making oomd a universal solution for this; but I came across this issue asserting PSI (kernel pressure stall information) does not work well without swap. https://github.com/facebookincubator/oomd/issues/80
Ignoring whether+what+when a workaround may be found for that, what do you think about always having swap-on-ZRAM enabled in these same environments? The idea there is a configurable size /dev/zram block device (basically a compressible RAM disk) on which swap is created. Based on discussions with anaconda, IoT, Workstation, and systemd folks - I think there's a potential to converge on systemd-zram generator (rust) to do this. https://github.com/systemd/zram-generator
Workstation wg is mulling over the idea of dropping separate swap partitions entirely, and using a swap-on-ZRAM device instead; possibly with a dynamically created swapfile for certain use cases like hibernation. So I'm curious if this might have broader appeal, and get systemd-zram generator production ready.
On 1/8/20 5:21 PM, Chris Murphy wrote:
On Mon, Jan 6, 2020 at 7:56 PM Dusty Mabe dusty@dustymabe.com wrote:
For cloud at least it's very common to not have swap. I'd argue for servers you don't want them swapping either but resources aren't quite as elastic as in the cloud so you might not be able to burst resources like you can in the cloud.
There's also discussion about making oomd a universal solution for this; but I came across this issue asserting PSI (kernel pressure stall information) does not work well without swap. https://github.com/facebookincubator/oomd/issues/80
Ignoring whether+what+when a workaround may be found for that, what do you think about always having swap-on-ZRAM enabled in these same environments? The idea there is a configurable size /dev/zram block device (basically a compressible RAM disk) on which swap is created. Based on discussions with anaconda, IoT, Workstation, and systemd folks - I think there's a potential to converge on systemd-zram generator (rust) to do this. https://github.com/systemd/zram-generator
Workstation wg is mulling over the idea of dropping separate swap partitions entirely, and using a swap-on-ZRAM device instead; possibly with a dynamically created swapfile for certain use cases like hibernation. So I'm curious if this might have broader appeal, and get systemd-zram generator production ready.
Seems like an interesting concept. Since it doesn't require any disk setup it's easy to turn it off or configure it I assume.
+1
Dusty
On Mon, Jan 13, 2020 at 10:51 AM Dusty Mabe dusty@dustymabe.com wrote:
On 1/8/20 5:21 PM, Chris Murphy wrote:
On Mon, Jan 6, 2020 at 7:56 PM Dusty Mabe dusty@dustymabe.com wrote:
For cloud at least it's very common to not have swap. I'd argue for servers you don't want them swapping either but resources aren't quite as elastic as in the cloud so you might not be able to burst resources like you can in the cloud.
There's also discussion about making oomd a universal solution for this; but I came across this issue asserting PSI (kernel pressure stall information) does not work well without swap. https://github.com/facebookincubator/oomd/issues/80
Ignoring whether+what+when a workaround may be found for that, what do you think about always having swap-on-ZRAM enabled in these same environments? The idea there is a configurable size /dev/zram block device (basically a compressible RAM disk) on which swap is created. Based on discussions with anaconda, IoT, Workstation, and systemd folks - I think there's a potential to converge on systemd-zram generator (rust) to do this. https://github.com/systemd/zram-generator
Workstation wg is mulling over the idea of dropping separate swap partitions entirely, and using a swap-on-ZRAM device instead; possibly with a dynamically created swapfile for certain use cases like hibernation. So I'm curious if this might have broader appeal, and get systemd-zram generator production ready.
Seems like an interesting concept. Since it doesn't require any disk setup it's easy to turn it off or configure it I assume.
+1
Yes. My suggestion is to install this generator distribution wide. The on/off switch is the existence of a configuration file. If there's no config, the generator is a no op. And it won't run in containers regardless.
Next, the discussion is whether the distribution default is with config, or without config. Either way it's overridable.
I think a reasonable universal default would be something like a zram:RAM ratio of 1:2 or 1:1. And cap it to somewhere around 2-4G.
The rationale:
- Fedora IoT folks use swap on zram by default out of the box (via zram package, not this zram-generator) for a long time, maybe since the beginning.
- Upstream zram kernel devs say it's reasonable to go up to 2:1 because compression ratios are about 2:1, but it's pointless to go above that. Therefore 1:1 is quite conservative. 0.5 is even more conservative but still useful
- 1:1 is consistent with existing defaults (Anaconda, anyway)
- The cap means systems with a lot of RAM will only use it incidentally. Any time swap thrashing happens with traditional swap is IO bound, but becomes CPU bound on a zram device (because of all the compression/decompression hits). So making it small avoids too much of that.
- Considers upgrade behavior, where existing traditional swap on a partition is being used; create the swap on zram device with a high priority, so it's used first.