This subject matches a Fedora Workstation Working Group issue of the same name [1], and this post is intended to be an independent summary of the findings so far, and call for additional testing and discussion, in particular subject matter experts.
Problem and thesis statement: Certain workloads, such as building webkitGTK from source, results in heavy swap usage eventually leading to the system becoming totally unresponsive. Look into switching from disk based swap, to swap on a ZRAM device.
Summary of findings (restated, but basically the same as found at [2]): Test system, Macbook Pro, Intel Core i7-2820QM (4/8 cores), 8GiB RAM, Samsung SSD 840 EVO, Fedora Rawhide Workstation. Test case, build WebKitGTK from source.
$ cmake -DPORT=GTK -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja $ ninja
Case 1: 8GiB swap on SSD plain partition (not encrypted, not on LVM) Case 2: 8GiB swap on /dev/zram0
In each case, that swap is exclusive, there are no other swap devices. Within ~30 minutes in the first case, and ~10 minutes in the second case, the GUI is completely unresponsive, mouse pointer has frozen and doesn't recover after more than 30 minutes of waiting. By remote ssh, the first case is semi-responsive, updates should be every 5 seconds but are instead received every 2-5 minutes but it wasn't possible to compel recovery by cancelling the build process after another 30 minutes. By remote ssh, the second case is totally unresponsive, no updates for 30 minutes.
The system was manually forced power off at that point, in both cases. oom killer never triggered.
NOTE: ninja, by default on this system, sets N concurrent jobs to nrcpus + 2, which is 10 on this system. If I reboot with nr_cpus=4, ninja sets N jobs to 6.
Case 3: 2GiB swap on /dev/zram0 In one test this resulted in system hang (no pointer movement) within 5 minutes of executing ninja, and within another 6 minutes oom killer is invoked on a cc1plus process, which is fatal to the build process, remaining build related processes quit on their own, and the system eventually recovers.
But in two subsequent tests in this same configuration, oom killer wasn't invoked, and the system meandered between responsive for ~1 minute, totally frozen for 5-6 minutes, in a cycle lasting beyond 1 hour without ever triggering oom killer.
Screenshot taken during one of the moments the remote ssh session updated https://drive.google.com/open?id=1IDboR1fzP4onu_tzyZxsx7M5cT_RJ7Iz
The state had not changed after 45 minutes following the above screenshot so I forced power off on that system. But the point here is this slightly different configuration has some non-determinism to it, even though in the end it's a bad UX. The default, unprivileged build command is effectively taking down the system all the same.
Case 4: 8GiB swap on SSD plain partition, `ninja -j 4` This is the same setup as Case 1, except I manually set N jobs to 4. Build succeeds, and except for a few mouse pointer stutters, the system remains responsive, even Firefox with multiple tabs open, and youtube video playing. Exactly the experience we'd like to see, albeit not all CPU resources are used for the build, but clearly the limiting factor is this particular package requires more than ~14GiB to build successfully, and the system + shell + Firefox, just doesn't have that.
Starter questions: To what degree, and why, is this problem instigated by the build application (ninja in this example) or its supporting configuration files, including cmake? Or the kernel? Or the system configuration? Is it a straightforward problem, or is this actually somewhat nuanced with multiple components in suboptimal configuration coming together as the cause? Is it expected that an unprivileged user can run a command whose defaults eventually lead to a totally unrecoverable system? From a security risk standpoint, the blame can't be entirely on the user or the application configuration, but how should application containment be enforced? Other than containerizing the build programs, is there a practical way right now of enforcing CPU and memory limits on unprivileged applications? Other alternatives? At the very least it seems like getting to an oom killer sooner would result in a better experience, fail the process before the GUI becomes unresponsive and hangs out for 30+minutes (possibly many hours).
[1] https://pagure.io/fedora-workstation/issue/98 [2] https://pagure.io/fedora-workstation/issue/98#comment-588713
Thanks,
Hi,
Chris Murphy lists@colorremedies.com writes:
Certain workloads, such as building webkitGTK from source, results in heavy swap usage eventually leading to the system becoming totally unresponsive. Look into switching from disk based swap, to swap on a ZRAM device.
It sounds like the same issue that has been in the news recently:
- https://www.phoronix.com/scan.php?page=news_item&px=Linux-Does-Bad-Low-R... - https://news.ycombinator.com/item?id=20620545
Older sources with more information:
- https://lwn.net/Articles/759658/ - https://superuser.com/questions/1115983/prevent-system-freeze-unresponsivene...
(I learned about this bug the hard way; my machine experienced this bug in the middle of a public presentation a few years ago.)
Regards, Omair
-- PGP Key: B157A9F0 (http://pgp.mit.edu/) Fingerprint = 9DB5 2F0B FD3E C239 E108 E7BD DF99 7AF8 B157 A9F0
Just in case anyone wants to try to reproduce this particular example:
1. Grab latest stable from here and untar it https://webkitgtk.org/releases/ 2. Run this included script, which is dnf aware, to install dependencies ./Tools/gtk/install-dependencies 3. Additional packages I had to install to get it to build sudo dnf install ruby-devel openjpeg2-devel woff2-devel
On Fri, 09 Aug 2019 23:50:43 +0200, Chris Murphy wrote:
$ cmake -DPORT=GTK -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja
RelWithDebInfo is -O2 -g build. That is not suitable for debugging, for debugging you should use -DCMAKE_BUILD_TYPE=Debug (that is -g). RelWithDebInfo is useful for final rpm packages but those are build in Koji.
Debug build will have smaller debug info so the problem may go away.
If it does not go away then tune the parallelism. Low -j makes the build needlessly slow during compilation phase while high -j (up to about #cpus + 2 or so) will make the final linking phase with debug info to run out of memory. This is why LLVM has separate "-j" for the linking phase but that is implemented only in LLVM CMakeLists.txt files: https://llvm.org/docs/CMake.html LLVM_PARALLEL_LINK_JOBS So that you leave the default -j high but set LLVM_PARALLEL_LINK_JOBS to 1 or 2.
Other options for faster build times are also LLVM specific: -DLLVM_USE_LINKER=gold (maybe also lld now?) - as ld.gold or ld.lld are faster than ld.bfd -DLLVM_USE_SPLIT_DWARF=ON - Linking phase no longer deals with the huge debug info
Which should be applicable for other projects by something like (untested!): -DCMAKE_C_FLAGS="-gsplit-dwarf" -DCMAKE_CXX_FLAGS="-gsplit-dwarf" -DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=gold -Wl,--gdb-index" -DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=gold -Wl,--gdb-index"
(That gdb-index is useful if you are really going to debug it using GDB as I expect you are going to do when you want RelWithDebInfo and not Release; but then I would recommend Debug in such case anyway as debugging optimized code is very difficult.)
is there a practical way right now of enforcing CPU and memory limits on unprivileged applications?
$ help ulimit -m the maximum resident set size -u the maximum number of user processes -v the size of virtual memory
One can also run it with 'nice -n19', 'ionice -c3' and/or "cgclassify -g '*':hammock" (config attached).
But after all I recommend just more memory, it is cheap nowadays and I find 64GB just about the right size.
Jan
On Fri, Aug 09, 2019 at 03:50:43PM -0600, Chris Murphy wrote: [..]
Problem and thesis statement: Certain workloads, such as building webkitGTK from source, results in heavy swap usage eventually leading to the system becoming totally unresponsive. Look into switching from disk based swap, to swap on a ZRAM device.
Summary of findings (restated, but basically the same as found at [2]): Test system, Macbook Pro, Intel Core i7-2820QM (4/8 cores), 8GiB RAM, Samsung SSD 840 EVO, Fedora Rawhide Workstation. Test case, build WebKitGTK from source.
[..]
To avoid such issues I disable swap on my machines. I really don't see the point of having a swap partition if you have 16 or 32 GiB RAM. Even with 8 GiB I disable swap.
With - say - 8 GiB the build of a large project might fail (e.g. llvm, e.g. during linking) but it then fails fast and I can just restart it with `ninja -j2` or something like that.
Another source of IO related unresponsiveness is buffer bloat - I thus apply this configuration on my machines:
$ cat /etc/sysctl.d/01-disk-bufferbloat.conf vm.dirty_background_bytes=107374182 vm.dirty_bytes=214748364
Best regards Georg
On Sat, Aug 10, 2019 at 3:07 AM Jan Kratochvil jan.kratochvil@redhat.com wrote:
On Fri, 09 Aug 2019 23:50:43 +0200, Chris Murphy wrote:
$ cmake -DPORT=GTK -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja
RelWithDebInfo is -O2 -g build. That is not suitable for debugging, for debugging you should use -DCMAKE_BUILD_TYPE=Debug (that is -g). RelWithDebInfo is useful for final rpm packages but those are build in Koji.
I don't follow. You're saying RelWithDebInfo is never suitable for a local build?
I'm not convinced that matters, because what the user-developer is trying to accomplish post-build isn't relevant to getting a successful build. And also, this is just one example of how apparently easy it is to take down a system with an unprivileged task, per the various discussions I've had with members of the Workstation WG.
Anyway, the build fails for a different reason when I use Debug instead of RelWithDebInfo so I can't test it.
In file included from Source/JavaScriptCore/config.h:32, from Source/JavaScriptCore/llint/LLIntSettingsExtractor.cpp:26: Source/JavaScriptCore/runtime/JSExportMacros.h:32:10: fatal error: wtf/ExportMacros.h: No such file or directory 32 | #include <wtf/ExportMacros.h> | ^~~~~~~~~~~~~~~~~~~~ compilation terminated. [1131/2911] Building CXX object Sourc...er/preprocessor/DiagnosticsBase.cpp.o ninja: build stopped: subcommand failed.
Debug build will have smaller debug info so the problem may go away.
If it does not go away then tune the parallelism. Low -j makes the build needlessly slow during compilation phase while high -j (up to about #cpus
- 2 or so) will make the final linking phase with debug info to run out of
memory. This is why LLVM has separate "-j" for the linking phase but that is implemented only in LLVM CMakeLists.txt files: https://llvm.org/docs/CMake.html LLVM_PARALLEL_LINK_JOBS So that you leave the default -j high but set LLVM_PARALLEL_LINK_JOBS to 1 or 2.
Other options for faster build times are also LLVM specific: -DLLVM_USE_LINKER=gold (maybe also lld now?) - as ld.gold or ld.lld are faster than ld.bfd -DLLVM_USE_SPLIT_DWARF=ON - Linking phase no longer deals with the huge debug info
Which should be applicable for other projects by something like (untested!): -DCMAKE_C_FLAGS="-gsplit-dwarf" -DCMAKE_CXX_FLAGS="-gsplit-dwarf" -DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=gold -Wl,--gdb-index" -DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=gold -Wl,--gdb-index"
(That gdb-index is useful if you are really going to debug it using GDB as I expect you are going to do when you want RelWithDebInfo and not Release; but then I would recommend Debug in such case anyway as debugging optimized code is very difficult.)
is there a practical way right now of enforcing CPU and memory limits on unprivileged applications?
$ help ulimit -m the maximum resident set size -u the maximum number of user processes -v the size of virtual memory
One can also run it with 'nice -n19', 'ionice -c3' and/or "cgclassify -g '*':hammock" (config attached).
Thanks. I'll have to defer to others about how to incorporate this so the default build is more intelligently taking actual resources into account. My strong bias is that the user-developer can't be burdened with knowing esoteric things. The defaults should just work.
Let's take another argument. If the user manually specifies 'ninja -j 64' on this same system, is that sabotage? I'd say it is. And therefore why isn't it sabotage that the ninja default computes N jobs as nrcpus + 2? And also doesn't take available memory into account when deciding what resources to demand? I can build linux all day long on this system with its defaults and never run into a concurrent usability problem.
There does seem to be a dual responsibility, somehow, between the operating system and the application, to make sure sane requests are made and honored.
But after all I recommend just more memory, it is cheap nowadays and I find 64GB just about the right size.
That's an optimization. It can't be used as an excuse for an unprivileged task taking down a system.
On Sun, Aug 11, 2019 at 10:50 AM, Chris Murphy lists@colorremedies.com wrote:
Let's take another argument. If the user manually specifies 'ninja -j 64' on this same system, is that sabotage? I'd say it is. And therefore why isn't it sabotage that the ninja default computes N jobs as nrcpus + 2? And also doesn't take available memory into account when deciding what resources to demand? I can build linux all day long on this system with its defaults and never run into a concurrent usability problem.
There does seem to be a dual responsibility, somehow, between the operating system and the application, to make sure sane requests are made and honored.
This seems like a distraction from the real goal here, which is to ensure Fedora remains responsive under heavy memory pressure, and to ensure unprivileged processes cannot take down the system by allocating large amounts of memory. Fixing ninja and make to dynamically scale the number of parallel build processes based on memory pressure would be wonderful, but it's not going to solve the underlying issue here, which is that random user processes should never be able to hang the system.
Michael
On Sun, 11 Aug 2019 17:50:17 +0200, Chris Murphy wrote:
I don't follow. You're saying RelWithDebInfo is never suitable for a local build?
Most of the time. What is your use case for it?
isn't relevant to getting a successful build.
With powerful enough machine everything is possible. Just be aware RelWithDebInfo is the most resource demanding option compared to Release and Debug and at the same time it is the least useful one for local builds.
In file included from Source/JavaScriptCore/config.h:32, from Source/JavaScriptCore/llint/LLIntSettingsExtractor.cpp:26: Source/JavaScriptCore/runtime/JSExportMacros.h:32:10: fatal error: wtf/ExportMacros.h: No such file or directory
You are reinventing the wheel Fedora packager has already done for this package. I guess you are missing some dependency. If you have a problem stick to the proven build (unless it is temporarily FTBFS which this package is not now). I think Fedora recommends mock for such rebuild but I find mock inconvenient for local development so I use (I have some scripts for that): dnf download --source webkit2gtk3 mkdir webkit2gtk3-2.24.3-1.fc30.src cd webkit2gtk3-2.24.3-1.fc30.src rpm2cpio ../webkit2gtk3-2.24.3-1.fc30.src.rpm|cpio -id function rpmbuildlocal { time MAKEFLAGS= rpmbuild --define "_topdir $PWD" --define "_builddir $PWD" --define "_rpmdir $PWD" --define "_sourcedir $PWD" --define "_specdir $PWD" --define "_srcrpmdir $PWD" --define "_build_name_fmt %%{NAME}-%%{VERSION}-%%{RELEASE}.%%{ARCH}.rpm" "$@"; rmdir &>/dev/null BUILDROOT; } # Is the .src.rpm rebuild still needed? https://bugzilla.redhat.com/show_bug.cgi?id=1210276 rpmbuildlocal -bs *.spec sudo dnf builddep webkit2gtk3-2.24.3-1.fc30.src.rpm rm webkit2gtk3-2.24.3-1.fc30.src.rpm rpmbuildlocal -bc webkit2gtk3.spec 2>&1|tee log # or -bb or what do you want. It has built fine for me here now.
Let's take another argument. If the user manually specifies 'ninja -j 64' on this same system, is that sabotage?
For untrusted users Linux has given up for that, it is too big can of worms. Use virtual machine (KVM) with specified resources (memory size). Nowadays it should be also possible with less overhead by using Docker containers.
If you mean some local builds of your own causing runaway then (1) Turn off swap as RAM is cheap enough today. If something really runs out of the RAM it gets killed by kernel OOM. (2) Have the swap on NVMe, it from my experience does not kill the machine. (3) Use some reasonable ulimits in your ~/.bash_profile. (4) When the machine is really unresponsible login there from a different box and kill the culprits. From my own experience the machine is still able to accept new SSH connection, despite a bit slowly. But yes, I agree this problem has AFAIK no perfect solution.
Jan
On Sun, Aug 11, 2019 at 11:21 AM Jan Kratochvil jan.kratochvil@redhat.com wrote:
On Sun, 11 Aug 2019 17:50:17 +0200, Chris Murphy wrote:
I don't follow. You're saying RelWithDebInfo is never suitable for a local build?
Most of the time. What is your use case for it?
My use case is testing the responsiveness of Fedora Workstation under CPU and memory pressure, as experienced by an ordinary user.
In file included from Source/JavaScriptCore/config.h:32, from Source/JavaScriptCore/llint/LLIntSettingsExtractor.cpp:26: Source/JavaScriptCore/runtime/JSExportMacros.h:32:10: fatal error: wtf/ExportMacros.h: No such file or directory
You are reinventing the wheel Fedora packager has already done for this package.
That's out of scope.
I said from the outset this is an example. The central topic is that an unprivileged program is able to ask for resources that do not exist, and the operating system tries and fails to supply those resources, resulting not only in task failure, but the entire system is lost. In this example the user is doing other things concurrently and likely experiences data loss and possibly even file system corruption as a direct consequence of having to force power off on the machine because for all practical purposes normal control has been lost.
Let's take another argument. If the user manually specifies 'ninja -j 64' on this same system, is that sabotage?
For untrusted users Linux has given up for that, it is too big can of worms. Use virtual machine (KVM) with specified resources (memory size). Nowadays it should be also possible with less overhead by using Docker containers.
If you mean some local builds of your own causing runaway then (1) Turn off swap as RAM is cheap enough today. If something really runs out of the RAM it gets killed by kernel OOM. (2) Have the swap on NVMe, it from my experience does not kill the machine. (3) Use some reasonable ulimits in your ~/.bash_profile. (4) When the machine is really unresponsible login there from a different box and kill the culprits. From my own experience the machine is still able to accept new SSH connection, despite a bit slowly. But yes, I agree this problem has AFAIK no perfect solution.
I don't think it's acceptable in 2019 that an unpriviledged task takes out the entire operating system. As I mention in the very first post, remote ssh was not responsive for 30 minutes, at which point I gave up and forced power off. It's a bit of a trap though to suggest the user needs the ability and skill to remote ssh to kill off runaway programs, I refuse that premise.
It's completely sane for an ordinary user to consider that control of the system has been lost immediately upon experiencing a frozen mouse arrow.
On Sun, Aug 11, 2019 at 10:36 AM mcatanzaro@gnome.org wrote:
On Sun, Aug 11, 2019 at 10:50 AM, Chris Murphy lists@colorremedies.com wrote:
Let's take another argument. If the user manually specifies 'ninja -j 64' on this same system, is that sabotage? I'd say it is. And therefore why isn't it sabotage that the ninja default computes N jobs as nrcpus + 2? And also doesn't take available memory into account when deciding what resources to demand? I can build linux all day long on this system with its defaults and never run into a concurrent usability problem.
There does seem to be a dual responsibility, somehow, between the operating system and the application, to make sure sane requests are made and honored.
This seems like a distraction from the real goal here, which is to ensure Fedora remains responsive under heavy memory pressure, and to ensure unprivileged processes cannot take down the system by allocating large amounts of memory. Fixing ninja and make to dynamically scale the number of parallel build processes based on memory pressure would be wonderful, but it's not going to solve the underlying issue here, which is that random user processes should never be able to hang the system.
That's fair.
On Sun, 11 Aug 2019 20:54:28 +0200, Chris Murphy wrote:
and likely experiences data loss and possibly even file system corruption as a direct consequence of having to force power off on the machine because for all practical purposes normal control has been lost.
Not really, this is what journaling filesystem is there for.
But then there still can be an application-level data corruptions if an application does not handle its sudden termination properly. Which should be rare but IIRC I did see it for example with Firefox.
Jan
On Sun, Aug 11, 2019 at 1:02 PM Jan Kratochvil jan.kratochvil@redhat.com wrote:
On Sun, 11 Aug 2019 20:54:28 +0200, Chris Murphy wrote:
and likely experiences data loss and possibly even file system corruption as a direct consequence of having to force power off on the machine because for all practical purposes normal control has been lost.
Not really, this is what journaling filesystem is there for.
Successful journal replay obviates the need for fsck, it has nothing to do with avoiding corruption. And in any case, anything the user is working on that isn't already saved and committed to stable media, isn't going to survive the poweroff.
But then there still can be an application-level data corruptions if an application does not handle its sudden termination properly. Which should be rare but IIRC I did see it for example with Firefox.
I think the point at which the mouse pointer has frozen, the user has no practical means of controlling or interacting with the system, it's a failure.
In the short term, is it reasonable and possible, to get the oom killer to trigger sooner and thereby avoid the system becoming unresponsive in the first place? The oom score for most all processes is 0, and niced processes have their oom score increased. I'm not seeing levers to control how aggressive it is, only a way of hinting at which processes can be more readily subject to being killed. In fact, a requirement of oom killer is that swap is completely consumed, which if swap is on anything other than a fast SSD, swapping creates its own performance problems way before oom can be a rescuer. I think I just argued against my own question.
On 11. Aug 2019, at 23:05, Chris Murphy lists@colorremedies.com wrote:
I think the point at which the mouse pointer has frozen, the user has no practical means of controlling or interacting with the system, it's a failure.
In the short term, is it reasonable and possible, to get the oom killer to trigger sooner and thereby avoid the system becoming unresponsive in the first place? The oom score for most all processes is 0, and niced processes have their oom score increased. I'm not seeing levers to control how aggressive it is, only a way of hinting at which processes can be more readily subject to being killed. In fact, a requirement of oom killer is that swap is completely consumed, which if swap is on anything other than a fast SSD, swapping creates its own performance problems way before oom can be a rescuer. I think I just argued against my own question.
Yes you just did :-)
From what I understand from this LKML thread [1] fast swap on NVMe is only part of the issue (or adds to the issue). The kernel really really tries hard not to OOM kill anything and keep the system going. And this overcommitment is where it eventually gets unresponsive to the extend that the machine needs to be hard rebooted.
The LKML thread also mentions that user-space OOM handling could help.
But what about cgroups? Isn’t there a systemd utility that helps me wrap processes in resource constrained groups? Something along the line
$ systemd-run -p MemoryLimit=1G firefox
(Not tested.) I imagine that a well-behaved program will handle a bad malloc by ending itself?
BTW, this happens not only on Linux. I’m used to deal with quite big files during my day job and if you accidentally write some… em… very unsophisticated code that attempts to read the entire file into memory at once you can experience the same behavior on a recent macOS, too. You’re left with nothing else than force rebooting your machine.
[1] https://lkml.org/lkml/2019/8/4/15
BK
* Chris Murphy:
Summary of findings (restated, but basically the same as found at [2]): Test system, Macbook Pro, Intel Core i7-2820QM (4/8 cores), 8GiB RAM, Samsung SSD 840 EVO, Fedora Rawhide Workstation.
Do you use the built-in Intel graphics? Can you test with something else?
Thanks, Florian
On 2019-08-12, Florian Weimer fweimer@redhat.com wrote:
Do you use the built-in Intel graphics? Can you test with something else?
Does it have any effect? It happens to me even with a discrete GPU.
As far as I know integrated graphics arrays do not share physical memory from point of view of the CPU address space. The physical memory is split between GPU and CPU regions and CPU never see the GPU's physical memory. IOMMU can be asked for mapping GPU's memory into CPU's virtual space as can be done with any PCI card, but the physical memory is always separated. (Although it lives in the same memory chip.) Some BIOSes allows to define the UMA split (ratio beteen GPU and CPU memory). But that is out of control of an operating system and cannot be change until reset.
What actually happens is that some CPU physical memory is used for a GUI program text and some CPU memory for a block device I/O cache. Both purposes are handled uniformly by Linux. When the physical memory is exhausted, a memory allocator starts paging to a swap device. The evil thing is how memory pages are selected to be swapped out. The algorithm is to swap out the least recently used ones. And that is often the program text. Not the block cache. As a result your GUI becomes unresponsive because all the physical memory is filled with a block cache and the program text has to be reloaded from a block device. And what's worse, this happens even without swap space because program text pages are backed by a file and thus can dropped and loaded from a file system later. I.e. program text is always swapable.
A cure would be more fair memory allocator that could magically discover that a user is more interested in the few megabytes of his window manager than the gigabytes of a transfered file. The issue is that the allocator does not discriminate. A process can actully provide some hints using madvise(2) and mlock(2), but that does not apply to the program text, neither to the block cache in the kernel space. And even if processes provided hints, there always could be some adversarial program abusing others. Maybe if ulimit were augmented with a block cache maximal usage and an I/O scheduler accounted for that. That could help.
-- Petr
* Petr Pisar:
On 2019-08-12, Florian Weimer fweimer@redhat.com wrote:
Do you use the built-in Intel graphics? Can you test with something else?
Does it have any effect? It happens to me even with a discrete GPU.
I expect that the GEM shrinker (or rather, the reason why it is needed) radically alters kernel memory management.
Thanks, Florian
On Mon, Aug 12, 2019 at 12:30 AM Benjamin Kircher benjamin.kircher@gmail.com wrote:
On 11. Aug 2019, at 23:05, Chris Murphy lists@colorremedies.com wrote:
I think the point at which the mouse pointer has frozen, the user has no practical means of controlling or interacting with the system, it's a failure.
In the short term, is it reasonable and possible, to get the oom killer to trigger sooner and thereby avoid the system becoming unresponsive in the first place? The oom score for most all processes is 0, and niced processes have their oom score increased. I'm not seeing levers to control how aggressive it is, only a way of hinting at which processes can be more readily subject to being killed. In fact, a requirement of oom killer is that swap is completely consumed, which if swap is on anything other than a fast SSD, swapping creates its own performance problems way before oom can be a rescuer. I think I just argued against my own question.
Yes you just did :-)
From what I understand from this LKML thread [1] fast swap on NVMe is only part of the issue (or adds to the issue). The kernel really really tries hard not to OOM kill anything and keep the system going. And this overcommitment is where it eventually gets unresponsive to the extend that the machine needs to be hard rebooted.
The LKML thread also mentions that user-space OOM handling could help.
But what about cgroups? Isn’t there a systemd utility that helps me wrap processes in resource constrained groups? Something along the line
$ systemd-run -p MemoryLimit=1G firefox
(Not tested.) I imagine that a well-behaved program will handle a bad malloc by ending itself?
BTW, this happens not only on Linux. I’m used to deal with quite big files during my day job and if you accidentally write some… em… very unsophisticated code that attempts to read the entire file into memory at once you can experience the same behavior on a recent macOS, too. You’re left with nothing else than force rebooting your machine.
If I just run the example program, let's say systemd MemoryLimit is set to /proc/meminfo MemAvailable, the program is still going to try and bust out of that and fail. The failure reason is also non-obvious. Yes this is definitely an improvement in that the system isn't taken down.
How to do this automatically? Could there be a mechanism for the system and the requesting application to negotiate resources?
One reality is, the system isn't a good estimator of system responsiveness from the user's point of view. Anytime swap is under significant pressure (what's the definition of significant?) the system is effectively lost at that point, *if* this is a desktop system (includes laptops). In the example case, once swap is being heavily used on either the SSD, or on ZRAM, the mouse pointer is frozen variably 50%-90% of the time. It's not a usable system, well before swap is full. How does the system learn that a light swap rate is OK, but a heavy swap rate will lead to an angry user? And even heavy swap might be OK on NVMe, or on a server.
Right now the only lever to avoid swap, is to not create a swap partition at installation time. Or create a smaller one instead of 1:1 ratio with RAM. Or use a 1/4 RAM sized swap on ZRAM. A consequence of each of these alternatives, is hibernation can't be used. Fedora already explicitly does not support hibernation, but strictly that means we don't block release on hibernation related bugs. Fedora does still create a swap that meets the minimum size for hibernation, and also inserts the required 'resume' kernel parameter to locate the hibernation image at the next boot. So we kinda sorta do support it.
Another reality is, the example program, also doesn't have a good way of estimating the resources it needs. It has some levers, that just aren't being used by default, including -l option which reads "do not start new jobs if the load average is greater than N". But that's different than "tell me the box sizes you can use" and then the system supplying a matching box, and for the program to work within it.
On Mon, Aug 12, 2019 at 1:01 AM Florian Weimer fweimer@redhat.com wrote:
- Chris Murphy:
Summary of findings (restated, but basically the same as found at [2]): Test system, Macbook Pro, Intel Core i7-2820QM (4/8 cores), 8GiB RAM, Samsung SSD 840 EVO, Fedora Rawhide Workstation.
Do you use the built-in Intel graphics? Can you test with something else?
Only intel graphics. The AMD GPU on the test system is non-functional/defective. Other systems only have Intel graphics. I have tested this in a VM which I think is qxl graphics (?), and I get the same results, with minimal sample size. It seems like the oom happens more often and sooner on the VM, but that might because the VM is necessarily even more resource constrained than the host. But I have reproduced the total and seemingly indefinite hang. The results aren't completely deterministic, whether baremetal or VM. They're all "failures" in one form or another, but how they fail does differ run to run. And that's expected because to what degree I'm simultaneously browsing in Firefox, how many tabs are open, other programs being used, the user is a cause of that non-determinism and is a relevant factor.
On Mo, 12.08.19 09:40, Chris Murphy (lists@colorremedies.com) wrote:
How to do this automatically? Could there be a mechanism for the system and the requesting application to negotiate resources?
Ideally, GNOME would run all its apps as systemd --user services. We could then set DefaultMemoryHigh= globally for the systemd --user instance to some percentage value (which is taken relative to the physical RAM size). This would then mean every user app individually could use — let's say — 75% of the physical RAM size and when it wants more it would be penalized during reclaim compared to apps using less.
If GNOME would run all apps as user services we could do various other nice things too. For example, it could dynamically assign the fg app more CPU/IO weight than the bg apps, if the system is starved of both.
Right now the only lever to avoid swap, is to not create a swap partition at installation time. Or create a smaller one instead of 1:1 ratio with RAM. Or use a 1/4 RAM sized swap on ZRAM. A consequence of each of these alternatives, is hibernation can't be used. Fedora already explicitly does not support hibernation, but strictly that means we don't block release on hibernation related bugs. Fedora does still create a swap that meets the minimum size for hibernation, and also inserts the required 'resume' kernel parameter to locate the hibernation image at the next boot. So we kinda sorta do support it.
We could add a mode to systemd's hibernation support to only "swapon" a swap partition immediately before hibernating, and "swapoff" it right after coming back. This has been proposed before, but noone so far did the work on it. But quite frankly this feels just like taping over the fact that the Linux kernel is rubbish when it comes to swapping...
Another reality is, the example program, also doesn't have a good way of estimating the resources it needs. It has some levers, that just aren't being used by default, including -l option which reads "do not start new jobs if the load average is greater than N". But that's different than "tell me the box sizes you can use" and then the system supplying a matching box, and for the program to work within it.
As suggested above, I think DefaultMemoryHigh=75% would be an OK approach which would allow us adjust to the "beefiness" of a machine automatically.
Lennart
-- Lennart Poettering, Berlin
On 12. Aug 2019, at 17:40, Chris Murphy lists@colorremedies.com wrote:
If I just run the example program, let's say systemd MemoryLimit is set to /proc/meminfo MemAvailable, the program is still going to try and bust out of that and fail. The failure reason is also non-obvious. Yes this is definitely an improvement in that the system isn't taken down.
How to do this automatically? Could there be a mechanism for the system and the requesting application to negotiate resources?
Honestly, right now, doing this automatically is not possible.
Instead, we anticipate the workload or the nature of the work. Like as when we connect remotely to a box and start some long running process, we anticipate trouble with the network and use a terminal multiplexer, right? Same thing with resource intensive processes.
But in future, I could imagine that this whole control group mechanism really pays off in a way where we distribute system resources automatically.
Isn’t that what Silverblue is all about? Having a base system and on top of that, everything is run in a container that could be potentially resource constraint?
BK
On 12. Aug 2019, at 18:16, Lennart Poettering mzerqung@0pointer.de wrote:
On Mo, 12.08.19 09:40, Chris Murphy (lists@colorremedies.com) wrote:
How to do this automatically? Could there be a mechanism for the system and the requesting application to negotiate resources?
Ideally, GNOME would run all its apps as systemd --user services. We could then set DefaultMemoryHigh= globally for the systemd --user instance to some percentage value (which is taken relative to the physical RAM size). This would then mean every user app individually could use — let's say — 75% of the physical RAM size and when it wants more it would be penalized during reclaim compared to apps using less.
If GNOME would run all apps as user services we could do various other nice things too. For example, it could dynamically assign the fg app more CPU/IO weight than the bg apps, if the system is starved of both.
I really like the ideas. Why isn’t this done this way anyway?
I don’t have a GNOME desktop at hand right now to investigate how GNOME starts applications and so on but aren’t new processes started by the user — GNOME or not — always children of the user.slice? Is there a difference if I start a GNOME application or a normal process from my shell?
And for the beginning, wouldn’t it be enough to differentiate between user slices and system slice and set DefaultMemoryHigh= in a way to make sure there is always some headroom left for the system?
BK
(… I definitely need to play around with Silverblue to learn what they are doing.)
For what it's worth, my research group attacked basically exactly this problem some time ago. We built a modified Linux kernel that we called Redline that was utterly resilient to fork bombs, malloc bombs, and so on. No process could take down the system, much less unprivileged ones. I think some of the ideas we described back then would be worth adopting / adapting today (the code is of course hopelessly out of date: we published our paper on this at OSDI 2008).
We had a demo where we would run two identical systems, side by side, with the same workloads (a number of videos playing simultaneously), but with one running Redline, and the other running stock Linux. We would launch a fork/malloc bomb on both. The Redline system barely hiccuped. The stock Linux kernel would freeze and become totally unresponsive (or panic). It was a great demo, but also a pain, since we invariably had to restart the stock Linux box :).
Redline: first class support for interactivity in commodity operating systems
While modern workloads are increasingly interactive and resource-intensive (e.g., graphical user interfaces, browsers, and multimedia players), current operating systems have not kept up. These operating systems, which evolved from core designs that date to the 1970s and 1980s, provide good support for batch and command-line applications, but their ad hoc attempts to handle interactive workloads are poor. Their best-effort, priority-based schedulers provide no bounds on delays, and their resource managers (e.g., memory managers and disk I/O schedulers) are mostly oblivious to response time requirements. Pressure on any one of these resources can significantly degrade application responsiveness.
We present Redline, a system that brings first-class support for interactive applications to commodity operating systems. Redline works with unaltered applications and standard APIs. It uses lightweight specifications to orchestrate memory and disk I/O management so that they serve the needs of interactive applications. Unlike realtime systems that treat specifications as strict requirements and thus pessimistically limit system utilization, Redline dynamically adapts to recent load, maximizing responsiveness and system utilization. We show that Redline delivers responsiveness to interactive applications even in the face of extreme workloads including fork bombs, memory bombs and bursty, large disk I/O requests, reducing application pauses by up to two orders of magnitude.
Paper here (in case the attachment fails):
https://www.usenix.org/legacy/events/osdi08/tech/full_papers/yang/yang.pdf
And links to code here:
https://emeryberger.com/research/redline/
There has been some recent follow-on work in this direction: see this work out of Remzi and Andrea's lab at Wisconsin: http://pages.cs.wisc.edu/~remzi/Classes/739/Fall2016/Papers/splitio-sosp15.p...
-- emery
-- Professor Emery Berger College of Information and Computer Sciences University of Massachusetts Amherst www.emeryberger.org, @emeryberger
On Mon, Aug 12, 2019 at 10:07 AM Benjamin Kircher < benjamin.kircher@gmail.com> wrote:
On 12. Aug 2019, at 18:16, Lennart Poettering mzerqung@0pointer.de
wrote:
On Mo, 12.08.19 09:40, Chris Murphy (lists@colorremedies.com) wrote:
How to do this automatically? Could there be a mechanism for the system and the requesting application to negotiate resources?
Ideally, GNOME would run all its apps as systemd --user services. We could then set DefaultMemoryHigh= globally for the systemd --user instance to some percentage value (which is taken relative to the physical RAM size). This would then mean every user app individually could use — let's say — 75% of the physical RAM size and when it wants more it would be penalized during reclaim compared to apps using less.
If GNOME would run all apps as user services we could do various other nice things too. For example, it could dynamically assign the fg app more CPU/IO weight than the bg apps, if the system is starved of both.
I really like the ideas. Why isn’t this done this way anyway?
I don’t have a GNOME desktop at hand right now to investigate how GNOME starts applications and so on but aren’t new processes started by the user — GNOME or not — always children of the user.slice? Is there a difference if I start a GNOME application or a normal process from my shell?
And for the beginning, wouldn’t it be enough to differentiate between user slices and system slice and set DefaultMemoryHigh= in a way to make sure there is always some headroom left for the system?
BK
(… I definitely need to play around with Silverblue to learn what they are doing.) _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Mo, 12.08.19 19:06, Benjamin Kircher (benjamin.kircher@gmail.com) wrote:
On 12. Aug 2019, at 18:16, Lennart Poettering mzerqung@0pointer.de wrote:
On Mo, 12.08.19 09:40, Chris Murphy (lists@colorremedies.com) wrote:
How to do this automatically? Could there be a mechanism for the system and the requesting application to negotiate resources?
Ideally, GNOME would run all its apps as systemd --user services. We could then set DefaultMemoryHigh= globally for the systemd --user instance to some percentage value (which is taken relative to the physical RAM size). This would then mean every user app individually could use — let's say — 75% of the physical RAM size and when it wants more it would be penalized during reclaim compared to apps using less.
If GNOME would run all apps as user services we could do various other nice things too. For example, it could dynamically assign the fg app more CPU/IO weight than the bg apps, if the system is starved of both.
I really like the ideas. Why isn’t this done this way anyway?
Well, let's just say certain popular container managers blocked switching to cgroupsv2, and only in cgroupsv2 delegating cgroup subtrees to unprivileged users is safe. Hence doing this kind of resource management wasn't really doable without ugly hacks.
But as it appears cgroupsv2 has a chance of becoming a reality on Fedora now, so this opens a lot of doors.
I don’t have a GNOME desktop at hand right now to investigate how GNOME starts applications and so on but aren’t new processes started by the user — GNOME or not — always children of the user.slice? Is there a difference if I start a GNOME application or a normal process from my shell?
Well, "user.slice" is a concept of the *system* service manager, but desktop apps are if anything a concept of the *per-user* service manager.
And for the beginning, wouldn’t it be enough to differentiate between user slices and system slice and set DefaultMemoryHigh= in a way to make sure there is always some headroom left for the system?
From the system service manager's PoV all user apps together make up the user's 'user@.service' instance, it doesn#t look below.
i.e. cgroups is hierarchial, and various components can manage their own subtrees. PID 1 manages the top of the tree, and the per-user service manager a subtree of it that is below it and arranges per-user apps below that. But from PID1's PoV each of those per-user subtrees is opaque and it won't do resource management beneath that boundary. It's the job of the per-user service manager to do resource management there.
Lennart
-- Lennart Poettering, Berlin
On Mon, Aug 12, 2019 at 11:07 AM Benjamin Kircher benjamin.kircher@gmail.com wrote:
(… I definitely need to play around with Silverblue to learn what they are doing.)
I'm pretty sure Silverblue will be rebased on Fedora CoreOS which recently released a preview. I'm not sure what the time frame for that is, but maybe that work will be concurrent with work on a release version of Fedora CoreOS. The central means of installing/uninstalling and running applications on a future immutable system is flatpak. But you don't need to commit a system to Silverblue to use and test flatpak applications on Fedora 29/30 Workstation. Containerization is an option not a requirement of flatpaks, as is running it as a systemd --user instance.
Since layering is permitted with rpm-ostree based systems, using overlayfs, there still needs to be some way for the per-user service manager to enforce limits on unprivileged programs. The use of the word "limit" might be misleading. Perhaps instead it should be on defining and preserving the user interface responsiveness, whether that's CLI or GUI, so that control isn't lost. i.e. the unprivileged program gets the leftover resources, it's not a peer with the user interface. Promoting the active user interfaces relative to the unprivileged task would provide a way of effectively containing the unprivileged tasks, by one always being able to preempt the other.
On Sun, Aug 11, 2019 at 2:57 AM Georg Sauthoff mail@georg.so wrote:
On Fri, Aug 09, 2019 at 03:50:43PM -0600, Chris Murphy wrote: [..]
Problem and thesis statement: Certain workloads, such as building webkitGTK from source, results in heavy swap usage eventually leading to the system becoming totally unresponsive. Look into switching from disk based swap, to swap on a ZRAM device.
Summary of findings (restated, but basically the same as found at [2]): Test system, Macbook Pro, Intel Core i7-2820QM (4/8 cores), 8GiB RAM, Samsung SSD 840 EVO, Fedora Rawhide Workstation. Test case, build WebKitGTK from source.
[..]
To avoid such issues I disable swap on my machines. I really don't see the point of having a swap partition if you have 16 or 32 GiB RAM. Even with 8 GiB I disable swap.
Disabling swap doesn't avoid the issues, it can in fact make them worse.
If you have apps allocate memory they don't always OOM before the kernel tries to evict text pages, but since SSDs are fast it then tries to pull back in those text pages before realising (that is what most of the latest rounds of articles has been about). Something like firefox runs with no swap, starts to need more memory than the system has, parts of firefox executable get paged out, but then are needed for firefox to use the RAM, and round in circles it goes.
Having swap is still in this day and age better for your system that not having it.
Dave.
On Mon, Aug 12, 2019 at 6:31 PM David Airlie airlied@redhat.com wrote:
On Sun, Aug 11, 2019 at 2:57 AM Georg Sauthoff mail@georg.so wrote:
On Fri, Aug 09, 2019 at 03:50:43PM -0600, Chris Murphy wrote: [..]
Problem and thesis statement: Certain workloads, such as building webkitGTK from source, results in heavy swap usage eventually leading to the system becoming totally unresponsive. Look into switching from disk based swap, to swap on a ZRAM device.
Summary of findings (restated, but basically the same as found at [2]): Test system, Macbook Pro, Intel Core i7-2820QM (4/8 cores), 8GiB RAM, Samsung SSD 840 EVO, Fedora Rawhide Workstation. Test case, build WebKitGTK from source.
[..]
To avoid such issues I disable swap on my machines. I really don't see the point of having a swap partition if you have 16 or 32 GiB RAM. Even with 8 GiB I disable swap.
Disabling swap doesn't avoid the issues, it can in fact make them worse.
If you have apps allocate memory they don't always OOM before the kernel tries to evict text pages, but since SSDs are fast it then tries to pull back in those text pages before realising (that is what most of the latest rounds of articles has been about). Something like firefox runs with no swap, starts to need more memory than the system has, parts of firefox executable get paged out, but then are needed for firefox to use the RAM, and round in circles it goes.
Having swap is still in this day and age better for your system that not having it.
I agree that it's better to have swap for incidental swap purposes, rather than random things just getting abruptly hit with oom. I say random, because I see the oom_score_adj is the same for every process other than systemd-udev, auditd, sshd, and dbus. Plausibly the shell could get oom killed without warning, taking out the entire user session, all apps, and all the build processes.
I just discovered in the log from yesterday, that iotop was subject to oom killer, rather than one of the large cc1plus processes, which is what I'd previously consistently witnessed. So iotop and cc1plus must be in the ballpark oom score wise and oom killer just so happens to pick one or the other. iotop going away relieved just enough memory that nothing else was subject to oom killer, and yet processes were clearly resource starved nevertheless: the GUI was frozen, but then also other processes had already been dying due to timeouts, for example:
Aug 11 18:26:57 fmac.local systemd[1]: sssd-kcm.service: Control process exited, code=killed, status=15/TERM Aug 11 18:26:57 fmac.local systemd[1]: sssd-kcm.service: Failed with result 'timeout'.
Aug 11 18:27:00 fmac.local systemd[1]: systemd-journald.service: State 'stop-sigterm' timed out. Killing. Aug 11 18:27:00 fmac.local systemd[1]: systemd-journald.service: Killing process 31010 (systemd-journal) with signal SIGKILL. Aug 11 18:27:00 fmac.local systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
This is like a train wreck where there are all sorts of interesting sub failures happening. At one point I think, well we need better oom scores so the truly lowest important process is killed off. But upon big picture scrutiny, the system is failing before oom killer has been triggered. Processes are dying with timeouts. The GUI including the mouse pointer is frozen, even when swap is half full. Practically speaking, it's a goner the moment the mouse pointer froze the very first time. I might tolerate some stuttering here and there, but minutes of frozen state? Nah - not interested in seeing if this is another 5 minutes of choke, or 5 days.
And that's the bad side of swap is when the system is more than incidentally using it, and is depending on it. And apparently nothing is on a deadline timer if things can just start timing out on their own, including the system journal! That was a surprise to see. If it was that hung up, maybe I can't trust the journal entry times or order, maybe important entries were lost.
On 10 Aug 2019, at 17:56, Georg Sauthoff mail@georg.so wrote:
On Fri, Aug 09, 2019 at 03:50:43PM -0600, Chris Murphy wrote: [..]
Problem and thesis statement: Certain workloads, such as building webkitGTK from source, results in heavy swap usage eventually leading to the system becoming totally unresponsive. Look into switching from disk based swap, to swap on a ZRAM device.
Summary of findings (restated, but basically the same as found at [2]): Test system, Macbook Pro, Intel Core i7-2820QM (4/8 cores), 8GiB RAM, Samsung SSD 840 EVO, Fedora Rawhide Workstation. Test case, build WebKitGTK from source.
[..]
To avoid such issues I disable swap on my machines. I really don't see the point of having a swap partition if you have 16 or 32 GiB RAM. Even with 8 GiB I disable swap.
https://chrisdown.name/2018/01/02/in-defence-of-swap.html https://chrisdown.name/2018/01/02/in-defence-of-swap.html is worth a read - TL;DR the kernel used (pre 4.0) to be awful about swap, but modern kernels use it to avoid paging executable (file-backed) pages in low memory. If any paging is needed, lack of swap means that the kernel will page out active code before it gets as far as an OOM kill, resulting in a longer time to recover from memory contention (regardless of whether there's an OOM kill or the system recovers naturally).
Further, a sensible amount of swap (say 2 GiB or so) means that unused anonymous pages (e.g. data that's left over from initialization, or data that will only be needed when a process exits) can be swapped out and left on disk, freeing up valuable RAM for useful work.
Basically, a sane amount of swap is healthy - old advice about large amounts of swap is not.
With - say - 8 GiB the build of a large project might fail (e.g. llvm, e.g. during linking) but it then fails fast and I can just restart it with `ninja -j2` or something like that.
Another source of IO related unresponsiveness is buffer bloat - I thus apply this configuration on my machines:
$ cat /etc/sysctl.d/01-disk-bufferbloat.conf vm.dirty_background_bytes=107374182 vm.dirty_bytes=214748364
Best regards Georg -- 'Time your programs before making claims about efficiency' (Bjarne Stroustrup, The C++ Programming Language, 4th ed., p. 132, 2013) _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
(Oops, sorry, re-post because I messed up the threading.)
I'm not a developer, nor do I pretend to understand the nuances of memory management. But I signed up for this list just to say "thanks" to all the devs and others that are finally discussing what I consider to be one of the biggest problems with Linux on the desktop.
My experience with desktop Linux distros with SSDs when a few processes start to leak memory, or if I launch a new program when my system is right at the limits, is a full system hang where only the mouse occasionally moves jerkily, and I can't switch to a virtual terminal. I recently learned the SysRq trick to evoke the OOM killer, but I personally think that the kernel should deal with that, not the user. As unfortunate as it is for the OOM killer to have to randomly kill something, I am of the opinion that the OS should *never* lock up, period. I would strongly prefer that one application get killed instead of losing all my applications and working data because of a necessary hard reboot.
I don't know if this helps or not, but anecdotally I started see this issue *after* SSDs became more common, i.e. I don't think I ever experienced it with spinning rust. Maybe something to do with the vastly faster I/O of an SSD, which allows it to more quickly saturate the RAM before the OOM killer has time to react?
Also, I've had relatively low memory KVM guests running on a VPS under very high load, and they never lockup. The OOM killer does occasionally kick in, but the affected daemon or systemd service restarts and it's amazingly undramatic. It appears that this issue only occurs with Xorg (and I imagine Wayland) and "desktop" usage.
As for the problem of the randomness of the OOM killer, couldn't it be made to take into account the PID and/or how long the process has been running? Normally Xorg (and I assume Wayland stuff) gets started before the other desktop programs that tend to consume a lot of memory. So if it's a higher PID and/or has been running for less time, give it a higher score for killability.
In my experience on a system with 8GB of RAM and an SSD, the amount of swap space makes no difference. I've tried with no swap space, with 2GB, with 8GB, etc, and it still hangs under high memory usage. I've also tried tuning a lot of sysctl parameters such as vm.swappiness, vm.vfs_cache_pressure, and vm.min_free_kbytes, to no avail.
Don't know if this helps, but here are some additional discussions of Linux unresponsiveness under low memory situations from a layman's perspective: - osnews.com/story/130117/kde-usability-and-productivity-are-we-there-yet/ (in the comments) - unix.stackexchange.com/questions/373312/oom-killer-doesnt-work-properly-leads-to-a-frozen-os - bbs.archlinux.org/viewtopic.php?id=233843 - askubuntu.com/questions/432809/why-is-kswapd0-running-on-a-computer-with-no-swap/432827#432827 - unix.stackexchange.com/questions/24625/how-to-completely-disable-swap/24646#24646
Thanks again to everyone for looking into this!
BFQ scheduler help a lot with this issue. Using it on Fedora since 4.19 kernel. Also there was previous discussion about make it default for Workstation https://lists.fedoraproject.org/archives/list/kernel@lists.fedoraproject.org...
On Thu, Aug 15, 2019 at 1:51 AM Artem Tim ego.cordatus@gmail.com wrote:
BFQ scheduler help a lot with this issue. Using it on Fedora since 4.19 kernel. Also there was previous discussion about make it default for Workstation https://lists.fedoraproject.org/archives/list/kernel@lists.fedoraproject.org...
It's mentioned in the workstation issue as having no effect in this case. https://pagure.io/fedora-workstation/issue/98
I just switched to it and repeating the test case and the GUI still hangs, is unresponsive, even without substantial pressure on the SSD, and swap isn't even 1/2 used. https://drive.google.com/open?id=13_5XIBMu01HfOdzGVH-4qTgd-PLpFsaN
But I am getting something new in kernel messages:
542 sysrq+t during a GUI freeze that lasted over 1 minute, and then:
[ 718.068633] fmac.local kernel: SLUB: Unable to allocate memory on node -1, gfp=0x900(GFP_NOWAIT|__GFP_ZERO) [ 718.068636] fmac.local kernel: cache: page->ptl, object size: 72, buffer size: 72, default order: 0, min order: 0 [ 718.068639] fmac.local kernel: node 0: slabs: 296, objs: 16576, free: 0 [ 718.068704] fmac.local kernel: chronyd: page allocation failure: order:0, mode:0x800(GFP_NOWAIT), nodemask=(null),cpuset=/,mems_allowed=0
Not sure what to make of that. Complete 'journalctl -k' is here: https://drive.google.com/open?id=1Z1jAjMrmdXAxuSELdFfd4IKdceeufmVu
On Thu, Aug 15, 2019 at 2:19 PM Chris Murphy lists@colorremedies.com wrote:
[ 718.068633] fmac.local kernel: SLUB: Unable to allocate memory on node -1, gfp=0x900(GFP_NOWAIT|__GFP_ZERO) [ 718.068636] fmac.local kernel: cache: page->ptl, object size: 72, buffer size: 72, default order: 0, min order: 0 [ 718.068639] fmac.local kernel: node 0: slabs: 296, objs: 16576, free: 0 [ 718.068704] fmac.local kernel: chronyd: page allocation failure: order:0, mode:0x800(GFP_NOWAIT), nodemask=(null),cpuset=/,mems_allowed=0
Not sure what to make of that.
Asked on #fedora-kernel, it's a known issue with 5.3.0-rc4 and drm.
On Fri, Aug 16, 2019 at 7:48 AM Chris Murphy lists@colorremedies.com wrote:
On Thu, Aug 15, 2019 at 2:19 PM Chris Murphy lists@colorremedies.com wrote:
[ 718.068633] fmac.local kernel: SLUB: Unable to allocate memory on node -1, gfp=0x900(GFP_NOWAIT|__GFP_ZERO) [ 718.068636] fmac.local kernel: cache: page->ptl, object size: 72, buffer size: 72, default order: 0, min order: 0 [ 718.068639] fmac.local kernel: node 0: slabs: 296, objs: 16576, free: 0 [ 718.068704] fmac.local kernel: chronyd: page allocation failure: order:0, mode:0x800(GFP_NOWAIT), nodemask=(null),cpuset=/,mems_allowed=0
Not sure what to make of that.
Asked on #fedora-kernel, it's a known issue with 5.3.0-rc4 and drm.
Nope it's not that.
Something has leaked all your memory (not drm).
Dave.
On Mon, Aug 12, 2019 at 10:20 AM Lennart Poettering mzerqung@0pointer.de wrote:
On Mo, 12.08.19 09:40, Chris Murphy (lists@colorremedies.com) wrote:
Right now the only lever to avoid swap, is to not create a swap partition at installation time. Or create a smaller one instead of 1:1 ratio with RAM. Or use a 1/4 RAM sized swap on ZRAM. A consequence of each of these alternatives, is hibernation can't be used. Fedora already explicitly does not support hibernation, but strictly that means we don't block release on hibernation related bugs. Fedora does still create a swap that meets the minimum size for hibernation, and also inserts the required 'resume' kernel parameter to locate the hibernation image at the next boot. So we kinda sorta do support it.
We could add a mode to systemd's hibernation support to only "swapon" a swap partition immediately before hibernating, and "swapoff" it right after coming back. This has been proposed before, but noone so far did the work on it. But quite frankly this feels just like taping over the fact that the Linux kernel is rubbish when it comes to swapping...
I'm skeptical as well. But to further explore this:
1. Does the kernel know better than to write a hibernation image (all or part) to a /dev/zram device? e.g. a system with: 8GiB RAM, 8GiB swap on ZRAM, 8GiB swap partition. We can use swap priority to use the ZRAM device first, and conventional swap partition second. If the user, today, were to hibernate, what happens?
2. Are you suggesting it would be possible to build support for multiple swaps and have them dynamically enabled/disabled? e.g. the same system as above, but the 8GiB swap on disk is actually made across two partitions. i.e. a 2GiB partition and 6GiB partition. Normal operation would call for swapon for /dev/zram *and* the small on-disk swap. Only for hibernation would swapon happen for the larger on-disk swap partition (the 2GiB one always stays on).
That's... interesting. It sounds potentially complicated. I can't estimate if it could be fragile.
Let's consider something else: Hibernation is subject to kernel lockdown policy on UEFI Secure Boot enabled computers. What percentage of Fedora users these days are likely subject to this lockdown? Are we able to effectively support hibernation? On the one hand, Fedora does not block on hibernation bugs (kernel or firmware), thus not supported. But tacitly hibernation is supported because a bunch of users pushed an effort with Anaconda folks to make sure the swap device is set with "resume=" boot parameter with out of the box installations.
Another complicating issue: the Workstation working group has an issue to explore better protecting user data by encrypting /home by default. Of course, user data absolutely can and does leak into swap. Therefore I think we're obligated to consider encrypting swap too. And if swap is encrypted, how does resume from hibernation work? I guess kernel+initramfs load, and plymouth asks for passphrase which unlocks encrypted swap, and the kernel knows to resume from that device-mapper device?
I'm really skeptical of pissing off users who want hibernation to work. But I'm also very skeptical of compromising other priorities, and diverting resources, just for hibernation.
If you wait long enough between replies, I will find another log to throw on this fire, somewhere. :-D
On Mo, 19.08.19 13:58, Chris Murphy (lists@colorremedies.com) wrote:
I'm skeptical as well. But to further explore this:
- Does the kernel know better than to write a hibernation image (all
or part) to a /dev/zram device? e.g. a system with: 8GiB RAM, 8GiB swap on ZRAM, 8GiB swap partition. We can use swap priority to use the ZRAM device first, and conventional swap partition second. If the user, today, were to hibernate, what happens?
Usespace takes care of this. It tells the kernel which swap device to hibernate to and it nowadays understands that zswap is not a candidate, and picks the largest swap with the highes prio these days:
https://github.com/systemd/systemd/blob/master/src/shared/sleep-config.c#L18...
- Are you suggesting it would be possible to build support for
multiple swaps and have them dynamically enabled/disabled? e.g. the same system as above, but the 8GiB swap on disk is actually made across two partitions. i.e. a 2GiB partition and 6GiB partition. Normal operation would call for swapon for /dev/zram *and* the small on-disk swap. Only for hibernation would swapon happen for the larger on-disk swap partition (the 2GiB one always stays on).
Yes, that's what I was suggesting.
That's... interesting. It sounds potentially complicated. I can't estimate if it could be fragile.
Yeah. It's an idea. No sure it's a good one though.
Let's consider something else: Hibernation is subject to kernel lockdown policy on UEFI Secure Boot enabled computers. What percentage of Fedora users these days are likely subject to this lockdown? Are we able to effectively support hibernation? On the one hand, Fedora does not block on hibernation bugs (kernel or firmware), thus not supported. But tacitly hibernation is supported because a bunch of users pushed an effort with Anaconda folks to make sure the swap device is set with "resume=" boot parameter with out of the box installations.
We probably should look into supporting hibernation to encrypted swap with a key tied to the TPM. That way hibernation should be fully safe.
Another complicating issue: the Workstation working group has an issue to explore better protecting user data by encrypting /home by default. Of course, user data absolutely can and does leak into swap. Therefore I think we're obligated to consider encrypting swap too. And if swap is encrypted, how does resume from hibernation work? I guess kernel+initramfs load, and plymouth asks for passphrase which unlocks encrypted swap, and the kernel knows to resume from that device-mapper device?
I am pretty sure swap encryption really should be tied to the TPM. In fact, it's one of the very few cases where tying things to the TPM exclusively really makes sense.
So far noone prepared convincing patches to do this though. If anyone wants to look into this, I'd be happy to review a patch for systemd-cryptsetup for example.
Lennart
-- Lennart Poettering, Berlin
On Tue, Aug 20, 2019 at 2:15 AM Lennart Poettering mzerqung@0pointer.de wrote:
On Mo, 19.08.19 13:58, Chris Murphy (lists@colorremedies.com) wrote:
I'm skeptical as well. But to further explore this:
- Does the kernel know better than to write a hibernation image (all
or part) to a /dev/zram device? e.g. a system with: 8GiB RAM, 8GiB swap on ZRAM, 8GiB swap partition. We can use swap priority to use the ZRAM device first, and conventional swap partition second. If the user, today, were to hibernate, what happens?
Usespace takes care of this. It tells the kernel which swap device to hibernate to and it nowadays understands that zswap is not a candidate, and picks the largest swap with the highes prio these days:
For what it's worth, swap on /dev/zram is a totally different thing than zswap.
/dev/zram is just a compressed RAM disk. You can configure a size, but it only consumes memory as it actually gets used, dynamic allocation. This can be used for swap standalone, no conventional disk based swap partition is needed. But if there is one, and it's set to a lower priority than swap on /dev/zram, then it has the effect of spilling over (but spill over is uncompressed).
zswap basically always compresses all of swap, with a predefined size memory pool "cache", and requires a conventional disk based swap partition as the spill over. Spill over is also compressed.
They superficially sound very similar but the strategies are different on the details. I've been using both strategies (separately), but have the most experience with zswap even though above I was referring to swap on a ZRAM device. I know, so many Z's. But gist is, I can't really discern any differences from a user point of view.
Zwap uses just a few kernel parameters to set it up. Whereas with swap on zram, it requires a service unit file to setup the block device, mkswap, and then swapon.
The swap on ZRAM thing is further complicated by multiple implementations, and the preferred systemd zram-generator is apparently broken. https://github.com/systemd/zram-generator/issues/4
IoT folks are using swap on ZRAM now, via the Fedora zram package (systemd unit file to set everything up). Anaconda folks have their own built-in swap on ZRAM setup that runs on low memory systems when anaconda is launched. This happens on both Fedora netinstalls and LiveOS. And it makes sense for those use cases where a disk based swap partition doesn't exist, and maybe shouldn't.
Whereas for servers and workstations, zswap is well suited, as they're perhaps more likely to have a conventional swap partition and have use cases where spillover is likely.
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Docume... https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Docume...
And
https://www.mjmwired.net/kernel/Documentation/blockdev/zram.txt
So why not zswap? Well, kernel documentation shows it as being experimental still, but upstream considers it stable enough for production use using zbud allocator now, and z3fold allocator by the end of the summer they think. https://bugzilla.kernel.org/show_bug.cgi?id=204563#c6
*shrug*
On Mon, Aug 12, 2019 at 5:47 PM Emery Berger emery.berger@gmail.com wrote:
For what it's worth, my research group attacked basically exactly this problem some time ago. We built a modified Linux kernel that we called Redline that was utterly resilient to fork bombs, malloc bombs, and so on. No process could take down the system, much less unprivileged ones. I think some of the ideas we described back then would be worth adopting / adapting today (the code is of course hopelessly out of date: we published our paper on this at OSDI 2008).
I'm unable to find a concurring or dissenting opinions on this. What kind of peer review has it received? Was it ever raised with upstream kernel developers? What were there responses?
I wonder if the question of interactivity is just not a priority upstream still, as they see various competing user space solutions for this problem and that this suggests a generic solution is either not practical to incorporate into the kernel, or maybe it isn't desired?
Chris Murphy lists@colorremedies.com writes:
On Mon, Aug 12, 2019 at 5:47 PM Emery Berger emery.berger@gmail.com wrote:
For what it's worth, my research group attacked basically exactly this problem some time ago. We built a modified Linux kernel that we called Redline that was utterly resilient to fork bombs, malloc bombs, and so on. No process could take down the system, much less unprivileged ones. I think some of the ideas we described back then would be worth adopting / adapting today (the code is of course hopelessly out of date: we published our paper on this at OSDI 2008).
I'm unable to find a concurring or dissenting opinions on this. What kind of peer review has it received? Was it ever raised with upstream kernel developers? What were there responses?
I have only read parts of the Redline paper so I do not know if it was ever tried to submit this upstream.
Judging from the Redline webpage (https://emeryberger.com/research/redline/), it appears to only ever been implemented on i386 and nowhere else (albeit that shouldn't be hard to fix). Furthermore it does not support NUMA, which might be a bigger blocker.
My guess is that Redline might clash with the general idea how processes should be scheduled of upstream Linux. Redline solves the problem of keeping interactive applications interactive even under severe memory pressure by changing the way they are scheduled, allocated memory and how much data they are allowed to read from disks. If an application is classified as interactive (in contrast to best-effort tasks, which corresponds to a process in the current Linux kernel), then it will get a requested amount of CPU time each x ms (e.g. to be able to run at 25 fps). Something comparable is done with memory and disk usage.
This is a pretty nice approach in my opinion but it has certain downsides: - scheduling gets more complicated - you need additional system calls to tell the kernel which processes are interactive (otherwise they are treated the "old" way and you gain nothing) - you need a userspace component that has a database of interactive tasks (with a small set of configs, e.g. how often does your process need a chunk of the CPU time)
It could be that the kernel community would perceive that as a blocker and would instead prefer a different and more generic solution (this is just my personal guess). It could also very well be that no one had time to actually upstream this, as it was an academic project (no offense intended, I've been in academia myself and know how things go).
Unfortunately, Redline was developed more than a decade ago, so upstreaming it nowadays is probably equivalent to a full rewrite, given the kernel's development pace.
Cheers,
Dan
On Mo, 12.08.19 09:40, Chris Murphy (lists(a)colorremedies.com) wrote:
Ideally, GNOME would run all its apps as systemd --user services. We could then set DefaultMemoryHigh= globally for the systemd --user instance to some percentage value (which is taken relative to the physical RAM size). This would then mean every user app individually could use — let's say — 75% of the physical RAM size and when it wants more it would be penalized during reclaim compared to apps using less.
If GNOME would run all apps as user services we could do various other nice things too. For example, it could dynamically assign the fg app more CPU/IO weight than the bg apps, if the system is starved of both.
Running each app as systemd --user services is something we've been trying to encourage teams to do at FB. It lets monitor things much better using the cgroup control files.
In addition, it lets us configure oomd ( https://github.com/facebookincubator/oomd ) to do much more intelligent things than kill the entire session. oomd is being proposed as a fedora package right now. I think the last missing piece for oomd to be really useful on desktop systems is the --user slice changes.
Our team at FB is working on a similar (but more generic) solution. All of our work is open source / upstreamed into the linux kernel and we're running it in production on quite a large scale already. Results are very promising. We'll be presenting about it at All Systems Go (multiple talks) this year.
We'd love to chat in-person if anyone is interested.