https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Bug ID: 1069718 Summary: Kernel container scalability issue in docker - VFS and mount path consuming too much CPU Product: Red Hat Enterprise Linux 7 Version: 7.0 Component: systemd Severity: urgent Priority: urgent Assignee: systemd-maint@redhat.com Reporter: sct@redhat.com QA Contact: qe-baseos-daemons@redhat.com CC: agk@redhat.com, alexl@redhat.com, fs-maint@redhat.com, fweimer@redhat.com, golang@lists.fedoraproject.org, ikent@redhat.com, jeder@redhat.com, jpoimboe@redhat.com, lsm5@redhat.com, lvm-team@redhat.com, mattdm@redhat.com, mgoldman@redhat.com, mjenner@redhat.com, perfbz@redhat.com, rwheeler@redhat.com, skottler@redhat.com, vbatts@redhat.com Depends On: 1061359, 1064929 Group: private
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1061359 [Bug 1061359] improve docker devicemapper backend scalability https://bugzilla.redhat.com/show_bug.cgi?id=1064929 [Bug 1064929] Kernel container scalability issue in docker - VFS and mount path consuming too much CPU
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
--- Comment #1 from Stephen Tweedie sct@redhat.com --- Simple test, replicating the results seen with docker but just using simple regular file bind mounts:
Using private mounts (because we're already looking into the kernel scaling problems with rshared)
# make --mount-rprivate / # mkdir -p /tmp/mnt # cd /tmp/mnt # touch foo # time for f in `seq 1 4000`; do echo $f; touch $f; mount --bind foo $f; done ... real 0m28.960s user 0m1.733s sys 0m5.690s
and now waiting for dbus to settle: # ps augx --sort=cputime | tail -3 root 654 7.0 0.0 34680 1656 ? Ss 10:48 0:22 /usr/lib/systemd/systemd-logind root 1 8.7 22.4 930212 873216 ? Ss 10:48 0:27 /usr/lib/systemd/systemd dbus 655 13.6 0.0 37072 2016 ? Ssl 10:48 0:43 /bin/dbus-daemon
So after 4000 bind mounts we're at 0.87gb RSS for systemd and 1m32s CPU time for systemd/dbus.
Next, unmount: # time umount * ... real 1m26.550s user 0m28.378s sys 0m25.493s
and once dbus activity settles again: # ps augx --sort=cputime | tail -3 root 1 12.4 27.9 1178420 1084420 ? Ss 10:48 1:12 /usr/lib/systemd/systemd dbus 655 14.9 8.0 372220 314008 ? Ssl 10:48 1:26 /bin/dbus-daemon
(systemd-logind had despawned by the time I captured this)
so 2m38s CPU time and >1GB RSS now, even ignoring systemd-logind. And repeat the mounts...
# ps augx --sort=cputime | tail -3 root 12075 17.4 2.2 128180 89052 ? Ss 10:57 0:42 /usr/lib/systemd/systemd-logind root 1 13.3 27.9 1178420 1084444 ? Ss 10:48 1:42 /usr/lib/systemd/systemd dbus 655 17.5 8.0 372220 314008 ? Ssl 10:48 2:14 /bin/dbus-daemon
and unmount...
# ps augx --sort=cputime | tail -3 root 12075 19.5 2.2 128180 89052 ? Rs 10:57 1:09 /usr/lib/systemd/systemd-logind root 1 17.2 38.2 1558316 1484712 ? Ss 10:48 2:32 /usr/lib/systemd/systemd dbus 655 21.5 8.0 372220 314008 ? Rsl 10:48 3:09 /bin/dbus-daemon
So after just two mount/umount cycles (of a trivial bind mount), I've accumulated 6m50s of CPU time in dbus/systemd, and systemd itself has grown to an RSS of > 1.4GB. After a short while it reduced back down to ~900MB, but no lower.
This seems to be reasonably stable: after a third cycle it's again 1.4GB peak systemd RSS, and 11m48s cpu time.
Bind mounts on regular files are pretty much standard technology for doing anything stateless, the numbers here are not particularly unrealistic; 1.4GB RSS for systemd is fairly excessive here.
--Stephen
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
--- Comment #2 from Stephen Tweedie sct@redhat.com --- Additional data forwarded from Al Viro:
On Mon, Feb 24, 2014 at 10:13:43PM +0100, Lennart Poettering wrote:
Heya,
There are some O(n^2) issues when we match mounts against each other in systemd, but so far we never had an issue with that, because the number of mounts stayed reasonably low. These are things that should be fixable though.
I am not too concerned about the bus messages issue though I must say, as they only grow O(n), but they are of course ugly to see on the bus.
No, they actually grow O(n) *per* *change*. So umount `seq 5000` will end up with ~12 millions of messages sent. All buffered in PID 1 memory - just watch the RSS.
The kernel API that requires to linearly reread the entire mount table each time and compare it with what was before is also a source of O(n^2) behaviour which we probably should look into first?
Nowhere near the top. And why would that be O(n^2), anyway? For pity sake, they have unique numeric IDs, so you don't even need to sort anything.
Before we start masking mounts for view in systemd we really should see if we can't mask them from the entire system. Because these problems won't be specific to systemd (well, the O(n^2) matching issue is), but to any software that watches mountinfo, like udisks and whatnot.
Matching is a non-issue. Here's a trivial test for you:
cd /tmp; for i in `seq 100`; do touch $i; mount --bind /proc/version $i; done dbus-monitor --system >log & umount `seq 100` kill %1
now look into log. You'll see ~5000 messages in there. And yes, it is quadratic by number of mounts. s/100/1000/ in the above and you'll see about half a million messages. 10000 will almost certainly OOM the box.
Matching is actually nowhere near quadratic, unless you have really sucky implementation of manager_get_unit(). Could've been done better (there's a reasonably dense constant integer in every line, so you could've used an array), but unless your manager_get_unit() does something on the scale of "let's hold them all in a list and walk it", you are not going to see O(n^2) there.
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Stephen Tweedie sct@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|Kernel container |systemd container |scalability issue in docker |scalability issue in docker |- VFS and mount path |- mount/umount handling |consuming too much CPU |consuming too much CPU and | |memory
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Bhavna Sarathy bsarathy@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Blocks| |1069814
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
--- Comment #3 from Lennart Poettering lpoetter@redhat.com --- The O(n^2) PropertiesChanged messages will not be generated anymore with systemd git:
http://cgit.freedesktop.org/systemd/systemd/commit/?id=ff5f34d08c191c326c41a...
But I am working on a couple of more fixes to follow.
And again, the kernel interface of /proc/self/mountinfo forces us to do some time O(n^2) things here, so we cannot really fix this without kernel support to scale better than that: when you mount a thousand mounts, then this will in the worst case generate a thousand POLLURG events to systemd, which will then read in the worst case a thousand entries from /proc/self/mountinfo, each time. The kernel interface would need have to tell us about invidiual mounts coming/going if we want this to scale better than O(n^2) in the long run...
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
--- Comment #4 from Stephen Tweedie sct@redhat.com --- (In reply to Lennart Poettering from comment #3)
The O(n^2) PropertiesChanged messages will not be generated anymore with systemd git:
http://cgit.freedesktop.org/systemd/systemd/commit/ ?id=ff5f34d08c191c326c41a083745522383ac86cae
But I am working on a couple of more fixes to follow.
Thanks --- I've built a rhel7 systemd with this plus aef831369cd2a7a1bd4a58dd96ff8628ed6a85f9, and things are *vastly* improved.
The 4000 bind mounts which used to take ~30sec and balloon systemd up to ~1GB RSS now completes in about 8sec total, systemd accumulated time remains low and its RSS is currently 36MB. dbus itself doesn't even show up any more (~1sec accumulated cpu time after several cycles of mount/umount.)
And again, the kernel interface of /proc/self/mountinfo forces us to do some time O(n^2) things here, so we cannot really fix this without kernel support to scale better than that: when you mount a thousand mounts, then this will in the worst case generate a thousand POLLURG events to systemd, which will then read in the worst case a thousand entries from /proc/self/mountinfo, each time. The kernel interface would need have to tell us about invidiual mounts coming/going if we want this to scale better than O(n^2) in the long run...
Right; but the constant is low both for the read and the match, so I don't think this is a huge problem, and it's certainly not urgent. The dbus O(n^2) was the dominating factor and that seems solved here.
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Harald Hoyer harald@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |harald@redhat.com
--- Comment #5 from Harald Hoyer harald@redhat.com --- Your container PID 1 might be missing this:
http://cgit.freedesktop.org/systemd/systemd/tree/src/nspawn/nspawn.c#n2018
/* Mark everything as slave, so that we still * receive mounts from the real root, but don't * propagate mounts to the real root. */ if (mount(NULL, "/", NULL, MS_SLAVE|MS_REC, NULL) < 0) { log_error("MS_SLAVE|MS_REC failed: %m"); goto child_fail; }
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Lukáš Nykrýn lnykryn@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED CC| |lnykryn@redhat.com
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Jeremy Eder jeder@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |sct@redhat.com Flags| |needinfo?(sct@redhat.com)
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Stephen Tweedie sct@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(sct@redhat.com) |
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Jan Ščotka jscotka@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |jscotka@redhat.com, | |systemd-maint@redhat.com Flags| |needinfo?(systemd-maint@red | |hat.com)
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Stephen Tweedie sct@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Depends On| |1072446
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Stephen Tweedie sct@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Depends On| |1072451
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1072451 [Bug 1072451] Kernel container scalability issue in docker - [2/4] Expand size of mount hash table
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Stephen Tweedie sct@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Depends On| |1072457
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Stephen Tweedie sct@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Depends On| |1072461
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Stephen Tweedie sct@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(systemd-maint@red | |hat.com) |
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Jack Rieden jrieden@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |jrieden@redhat.com
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Lukáš Nykrýn lnykryn@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |MODIFIED Fixed In Version| |systemd-208-7.el7
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
errata-xmlrpc errata-xmlrpc@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|MODIFIED |ON_QA
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Jan Ščotka jscotka@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- QA Contact|qe-baseos-daemons@redhat.co |psklenar@redhat.com |m |
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Petr Sklenar psklenar@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Lukáš Nykrýn lnykryn@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo? |needinfo?(sct@redhat.com)
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Stephen Tweedie sct@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(sct@redhat.com) |
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Petr Sklenar psklenar@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|ON_QA |VERIFIED
https://bugzilla.redhat.com/show_bug.cgi?id=1069718
Ludek Smid lsmid@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|VERIFIED |CLOSED Resolution|--- |CURRENTRELEASE Last Closed| |2014-06-13 05:50:24
--- Comment #17 from Ludek Smid lsmid@redhat.com --- This request was resolved in Red Hat Enterprise Linux 7.0.
Contact your manager or support representative in case you have further questions about the request.
https://bugzilla.redhat.com/show_bug.cgi?id=1069718 Bug 1069718 depends on bug 1064929, which changed state.
Bug 1064929 Summary: Kernel container scalability issue in docker - VFS and mount path consuming too much CPU https://bugzilla.redhat.com/show_bug.cgi?id=1064929
What |Removed |Added ---------------------------------------------------------------------------- Status|VERIFIED |CLOSED Resolution|--- |CURRENTRELEASE
https://bugzilla.redhat.com/show_bug.cgi?id=1069718 Bug 1069718 depends on bug 1061359, which changed state.
Bug 1061359 Summary: improve docker devicemapper backend scalability https://bugzilla.redhat.com/show_bug.cgi?id=1061359
What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |CLOSED Resolution|--- |NOTABUG
golang@lists.fedoraproject.org