On Fr, 03.01.20 14:18, Ben Cotton (bcotton(a)redhat.com) wrote:
== Summary ==
Install earlyoom package, and enable it by default. This will cause
the kernel oomkiller to trigger sooner, but will not affect which
process it chooses to kill off. The idea is to recover from out of
memory situations sooner, rather than the typical complete system hang
in which the user has no other choice but to force power off.
Hmm, are we sure this is something we want to have in the default
install? Is the code really good enough for that?
Looking at the sources very superficially I see a couple of problems:
1. Waking up all the time in 100ms intervals? We generally try to
avoid waking the CPU up all the time if nothing happens. Saving
power and things.
2. New code using system() in the year 2020? Really?
3. Fixed size buffers and implicit, undetected, truncation of strings
at various places (for example, when formatting the shell string to
pass to system()).
But more importantly: are we sure this actually operates the way we
should? i.e. PSI is really what should be watched. It is not
interesting who uses how much memory and triggering kills on
that. What matters is to detect when the system becomes slow due to
that, i.e. *latencies* introduced due to memory pressure and that's
what PSI is about, and hence what should be used.
But even if we'd ignore that in order fight latencies one should watch
latencies: OOM killing per process is just not appropriate on a
systemd system: all our system services (and a good chunk of our user
services too) are sorted neatly into cgroups, and we really should
kill them as a whole and not just individual processes inside
them. systemd manages that today, and makes exceptions configurable
via OOMPolicy=, and with your earlyoom stuff you break that.
This looks like second guessing the kernel memory management folks at
a place where one can only lose, and at the time breaking correct OOM
reporting by the kernel via cgroups and stuff.
Also: what precisely is this even supposed to do? Replace the
algorithm for detecting *when* to go on a kill rampage? Or actually
replace the algorithm selecting *what* to kill during a kill rampage?
If it's the former (which the name of the project suggests,
_early_oom)), then at the most basic the tool should let the kernel do
the killing, i.e. "echo f > /proc/sysrq-trigger". That way the
reporting via cgroups isn't fucked, and systemd can still do its
thing, and the kernel can kill per cgroup rather than per process...
Anyway, this all sounds very very fishy to me. Not thought to the end,
and I am pretty sure this is something the kernel memory management
folks should give a blessing to. Second guessing the kernel like that
is just a bad idea if you ask me.
I mean, yes, the OOM killer might not be that great currently, but
this sounds like something to fix in kernel land, and if that doesn't
work out for some reason because kernel devs can't agree, then do it
as fallback in userspace, but with sound input from the kernel folks,
and the blessing of at least some of the kernel folks.
Lennart Poettering, Berlin