This subject matches a Fedora Workstation Working Group issue of the same name [1], and this post is intended to be an independent summary of the findings so far, and call for additional testing and discussion, in particular subject matter experts.
Problem and thesis statement: Certain workloads, such as building webkitGTK from source, results in heavy swap usage eventually leading to the system becoming totally unresponsive. Look into switching from disk based swap, to swap on a ZRAM device.
Summary of findings (restated, but basically the same as found at [2]): Test system, Macbook Pro, Intel Core i7-2820QM (4/8 cores), 8GiB RAM, Samsung SSD 840 EVO, Fedora Rawhide Workstation. Test case, build WebKitGTK from source.
$ cmake -DPORT=GTK -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja $ ninja
Case 1: 8GiB swap on SSD plain partition (not encrypted, not on LVM) Case 2: 8GiB swap on /dev/zram0
In each case, that swap is exclusive, there are no other swap devices. Within ~30 minutes in the first case, and ~10 minutes in the second case, the GUI is completely unresponsive, mouse pointer has frozen and doesn't recover after more than 30 minutes of waiting. By remote ssh, the first case is semi-responsive, updates should be every 5 seconds but are instead received every 2-5 minutes but it wasn't possible to compel recovery by cancelling the build process after another 30 minutes. By remote ssh, the second case is totally unresponsive, no updates for 30 minutes.
The system was manually forced power off at that point, in both cases. oom killer never triggered.
NOTE: ninja, by default on this system, sets N concurrent jobs to nrcpus + 2, which is 10 on this system. If I reboot with nr_cpus=4, ninja sets N jobs to 6.
Case 3: 2GiB swap on /dev/zram0 In one test this resulted in system hang (no pointer movement) within 5 minutes of executing ninja, and within another 6 minutes oom killer is invoked on a cc1plus process, which is fatal to the build process, remaining build related processes quit on their own, and the system eventually recovers.
But in two subsequent tests in this same configuration, oom killer wasn't invoked, and the system meandered between responsive for ~1 minute, totally frozen for 5-6 minutes, in a cycle lasting beyond 1 hour without ever triggering oom killer.
Screenshot taken during one of the moments the remote ssh session updated https://drive.google.com/open?id=1IDboR1fzP4onu_tzyZxsx7M5cT_RJ7Iz
The state had not changed after 45 minutes following the above screenshot so I forced power off on that system. But the point here is this slightly different configuration has some non-determinism to it, even though in the end it's a bad UX. The default, unprivileged build command is effectively taking down the system all the same.
Case 4: 8GiB swap on SSD plain partition, `ninja -j 4` This is the same setup as Case 1, except I manually set N jobs to 4. Build succeeds, and except for a few mouse pointer stutters, the system remains responsive, even Firefox with multiple tabs open, and youtube video playing. Exactly the experience we'd like to see, albeit not all CPU resources are used for the build, but clearly the limiting factor is this particular package requires more than ~14GiB to build successfully, and the system + shell + Firefox, just doesn't have that.
Starter questions: To what degree, and why, is this problem instigated by the build application (ninja in this example) or its supporting configuration files, including cmake? Or the kernel? Or the system configuration? Is it a straightforward problem, or is this actually somewhat nuanced with multiple components in suboptimal configuration coming together as the cause? Is it expected that an unprivileged user can run a command whose defaults eventually lead to a totally unrecoverable system? From a security risk standpoint, the blame can't be entirely on the user or the application configuration, but how should application containment be enforced? Other than containerizing the build programs, is there a practical way right now of enforcing CPU and memory limits on unprivileged applications? Other alternatives? At the very least it seems like getting to an oom killer sooner would result in a better experience, fail the process before the GUI becomes unresponsive and hangs out for 30+minutes (possibly many hours).
[1] https://pagure.io/fedora-workstation/issue/98 [2] https://pagure.io/fedora-workstation/issue/98#comment-588713
Thanks,