On 11. Aug 2019, at 23:05, Chris Murphy
<lists(a)colorremedies.com> wrote:
I think the point at which the mouse pointer has frozen, the user has
no practical means of controlling or interacting with the system, it's
a failure.
In the short term, is it reasonable and possible, to get the oom
killer to trigger sooner and thereby avoid the system becoming
unresponsive in the first place? The oom score for most all processes
is 0, and niced processes have their oom score increased. I'm not
seeing levers to control how aggressive it is, only a way of hinting
at which processes can be more readily subject to being killed. In
fact, a requirement of oom killer is that swap is completely consumed,
which if swap is on anything other than a fast SSD, swapping creates
its own performance problems way before oom can be a rescuer. I think
I just argued against my own question.
Yes you just did :-)
From what I understand from this LKML thread [1] fast swap on NVMe is only part of the
issue (or adds to the issue). The kernel really really tries hard not to OOM kill anything
and keep the system going. And this overcommitment is where it eventually gets
unresponsive to the extend that the machine needs to be hard rebooted.
The LKML thread also mentions that user-space OOM handling could help.
But what about cgroups? Isn’t there a systemd utility that helps me wrap processes in
resource constrained groups? Something along the line
$ systemd-run -p MemoryLimit=1G firefox
(Not tested.) I imagine that a well-behaved program will handle a bad malloc by ending
itself?
BTW, this happens not only on Linux. I’m used to deal with quite big files during my day
job and if you accidentally write some… em… very unsophisticated code that attempts to
read the entire file into memory at once you can experience the same behavior on a recent
macOS, too. You’re left with nothing else than force rebooting your machine.
[1]
https://lkml.org/lkml/2019/8/4/15
BK