On Mon, Aug 12, 2019 at 12:30 AM Benjamin Kircher benjamin.kircher@gmail.com wrote:
On 11. Aug 2019, at 23:05, Chris Murphy lists@colorremedies.com wrote:
I think the point at which the mouse pointer has frozen, the user has no practical means of controlling or interacting with the system, it's a failure.
In the short term, is it reasonable and possible, to get the oom killer to trigger sooner and thereby avoid the system becoming unresponsive in the first place? The oom score for most all processes is 0, and niced processes have their oom score increased. I'm not seeing levers to control how aggressive it is, only a way of hinting at which processes can be more readily subject to being killed. In fact, a requirement of oom killer is that swap is completely consumed, which if swap is on anything other than a fast SSD, swapping creates its own performance problems way before oom can be a rescuer. I think I just argued against my own question.
Yes you just did :-)
From what I understand from this LKML thread [1] fast swap on NVMe is only part of the issue (or adds to the issue). The kernel really really tries hard not to OOM kill anything and keep the system going. And this overcommitment is where it eventually gets unresponsive to the extend that the machine needs to be hard rebooted.
The LKML thread also mentions that user-space OOM handling could help.
But what about cgroups? Isn’t there a systemd utility that helps me wrap processes in resource constrained groups? Something along the line
$ systemd-run -p MemoryLimit=1G firefox
(Not tested.) I imagine that a well-behaved program will handle a bad malloc by ending itself?
BTW, this happens not only on Linux. I’m used to deal with quite big files during my day job and if you accidentally write some… em… very unsophisticated code that attempts to read the entire file into memory at once you can experience the same behavior on a recent macOS, too. You’re left with nothing else than force rebooting your machine.
If I just run the example program, let's say systemd MemoryLimit is set to /proc/meminfo MemAvailable, the program is still going to try and bust out of that and fail. The failure reason is also non-obvious. Yes this is definitely an improvement in that the system isn't taken down.
How to do this automatically? Could there be a mechanism for the system and the requesting application to negotiate resources?
One reality is, the system isn't a good estimator of system responsiveness from the user's point of view. Anytime swap is under significant pressure (what's the definition of significant?) the system is effectively lost at that point, *if* this is a desktop system (includes laptops). In the example case, once swap is being heavily used on either the SSD, or on ZRAM, the mouse pointer is frozen variably 50%-90% of the time. It's not a usable system, well before swap is full. How does the system learn that a light swap rate is OK, but a heavy swap rate will lead to an angry user? And even heavy swap might be OK on NVMe, or on a server.
Right now the only lever to avoid swap, is to not create a swap partition at installation time. Or create a smaller one instead of 1:1 ratio with RAM. Or use a 1/4 RAM sized swap on ZRAM. A consequence of each of these alternatives, is hibernation can't be used. Fedora already explicitly does not support hibernation, but strictly that means we don't block release on hibernation related bugs. Fedora does still create a swap that meets the minimum size for hibernation, and also inserts the required 'resume' kernel parameter to locate the hibernation image at the next boot. So we kinda sorta do support it.
Another reality is, the example program, also doesn't have a good way of estimating the resources it needs. It has some levers, that just aren't being used by default, including -l option which reads "do not start new jobs if the load average is greater than N". But that's different than "tell me the box sizes you can use" and then the system supplying a matching box, and for the program to work within it.