On Mon, 2020-07-20 at 10:55 +0200, Kevin Kofler wrote:
That said, I do not see how the EarlyOOM heuristic, which allows,
depending
on the exact settings, something like 80-90% of swap to be used IN ADDITION
to 90+% RAM (and will only start doing anything if BOTH RAM and swap are
full) can prevent thrashing in any reliable way. My thrashing scenarios have
had much less swap than that used. (I have twice as much swap than RAM, so
when the EarlyOOM heuristics trigger, my programs are already trying to use
almost 3 times as much RAM as is actually available!)
Yeah, I think that EarlyOOM will mostly help in the scenario where you
end up not having enough file caches to run your processes. I do
believe that this is a common scenario where the machine continues to
thrash for long periods[1].
In most cases where plenty of swap is available, the symptoms should be
a lot less sever. i.e. I would expect it to become really sluggish, but
things *should* recover much more quickly. Though, e.g. a fork-bomb
could easily cause something like that.
Anyway, I think to resolve it, we really need to enable the CPU and IO
cgroup controllers. CPU is easy (you just need to set a few properties
in systemd), but there are some road blocks to get the IO controller
working.
Also, we can probably fix systemd-logind hanging simply by assigning a
MemoryLow= allocation of e.g. 60M to system.slice and also setting
either DisableControllers=memory (create a large pool of memory for all
system services) or DefaultMemoryLow=X (delegate memory further to each
service separately).
Benjamin
[1] To be honest, I am wondering if some process might have actually
filled up all the RAM even in your case. It can be hard to tell later
on what exactly happened.