On 11. Aug 2019, at 23:05, Chris Murphy lists@colorremedies.com wrote:
I think the point at which the mouse pointer has frozen, the user has no practical means of controlling or interacting with the system, it's a failure.
In the short term, is it reasonable and possible, to get the oom killer to trigger sooner and thereby avoid the system becoming unresponsive in the first place? The oom score for most all processes is 0, and niced processes have their oom score increased. I'm not seeing levers to control how aggressive it is, only a way of hinting at which processes can be more readily subject to being killed. In fact, a requirement of oom killer is that swap is completely consumed, which if swap is on anything other than a fast SSD, swapping creates its own performance problems way before oom can be a rescuer. I think I just argued against my own question.
Yes you just did :-)
From what I understand from this LKML thread [1] fast swap on NVMe is only part of the issue (or adds to the issue). The kernel really really tries hard not to OOM kill anything and keep the system going. And this overcommitment is where it eventually gets unresponsive to the extend that the machine needs to be hard rebooted.
The LKML thread also mentions that user-space OOM handling could help.
But what about cgroups? Isn’t there a systemd utility that helps me wrap processes in resource constrained groups? Something along the line
$ systemd-run -p MemoryLimit=1G firefox
(Not tested.) I imagine that a well-behaved program will handle a bad malloc by ending itself?
BTW, this happens not only on Linux. I’m used to deal with quite big files during my day job and if you accidentally write some… em… very unsophisticated code that attempts to read the entire file into memory at once you can experience the same behavior on a recent macOS, too. You’re left with nothing else than force rebooting your machine.
[1] https://lkml.org/lkml/2019/8/4/15
BK