systemd (Was Re: tmpfs for strategic directories)

Wed May 26 11:26:48 UTC 2010

On Tue, 25.05.10 23:02, Casey Dahlin (cdahlin at redhat.com) wrote:

> > Why do you say "cgroups are a dead end"? Sure, Scott claims that, but
> > uh, it's not the only place where he is simply wrong and his claims
> > baseless. In fact it works really well, and is one of the strong points
> > in systemd. I simply see no alternative for it. The points Scott raised
> > kinda showed that he never really played around with them.
> > 
> > Please read up on cgroups, before you just dismiss technology like that
> > as "dead end".
> > 
> 
> I did. When upstart was about to use them. 2 years ago. We chucked them by the
> following LPC.

Who is "we"?

> The problem we've found is that cgroups are too aggressive. They don't have a
> notion of sessions and count too much as being part of your service, so you end
> up with your screen session being counted as part of gdm.

Well, how exactly you set up the groups is up to you, but the way we do
it in systemd is stick every service in a seperate cgroup, plus every
user in a seperate one, too. Some examples:

 /systemd-1/avahi.service
 /systemd-1/getty at tty1.service
 /systemd-1/gdm.service
 /systemd-1/apache.service
 /user/lennart
 /user/cjd

The per-user cgroups are controlled via a PAM module. That way there's
finally a nice way how we can reliably clean up behind a user when he logs out:
we just kill his complete cgroup and he's gone.

In addition we can easily set all kinds of cgroup-based resource
enforcement to these groups, i.e. force user "lennart" to CPU 1, and say
that apache and all the cgi scripts it creates can only get up to 20%
CPU. And avahi-daemon could be forced to get a quarter of the available
RAM at max -- with all its processes summed up, regardless how often it
might fork.

And the whole thing is even recursive. If you run a per-user systemd as
user lennart, you will end up with a sub-hierarchy like this:

 /user/lennart/systemd-4711/dbus-daemon.service
 /user/lennart/systemd-4711/dconfd.service

And so on.

And the nice thing is that these cgroups are shown when you do "ps -eo
cgroup,...". You can always figure out from "ps" to which service a
process belongs, even if if it fork()ed a gazillions of times. And all
the keeping track is done by the kernel, basically for free. No
involvement from userspace.

> This is why setsid was added to the netlink connector.

Well, this is just flawed, on so many levels, that it
hurts. Asynchronously trying to follow how daemons fork/exit is just
inherently broken because they can do an exponential amount of forks in
the time you can (realisticly) collect them linearly only. Also, if a
daemon forks too often, netlink willl drop your messages, which makes an
easy-peasy way for processes to escape your supervision, using only
inprivileged operations. You have constant userspace wakeups. Everything
you apply on the processes is done asynchronously and hence is racy
(killing, renice, yadda yadda). The problem is simply that your grip on
the processes can never work, because you are scheduled at the same
priority as the daemons you supervise and you get all notifications
asynchronously. You *really* want to leave process tracking to the
kernel, and not try to emulate that in userspace. Everything else is
just unsafe and hence a joke in the context of process baby
sitting. It's like if you'd employ a babysitter and give the kid a bike
it can escape on while chaining your babysitter to the couch.

So, it's just not safe, processes can *easily* escape your
supervision. On top of that it is just ugly. And finally, you cannot
nicely show the service something belongs to in "ps" the way cgroups
give it to you for free. Then, you cannot set cgroup resource
enforcement for your services, since you simply have no cgroups. And no
nice interfacing with the other libcgroup tools either.

Meh, and this lists gets on like this.

People should not use cn_proc. It's evil. And if you are using for
anything except logging you are doomed. And even for logging it isnt
really useful.

> 1) Socket activation. Part of Upstart's roadmap. Would happen sooner if you
> cared to submit the patch. We don't think its good enough by itself, hence the
> rest of Upstart, but a socket activation subsystem that could reach as far as
> systemd and even work standalone in settings where systemd can work is
> perfectly within Upstart's scope. I'd be happy to firm up the design details
> with you if you wanted to contribute patches.

Well, for once, it would be nice to judge things due to actually
existign features, not of big plans nobody is working on as you
apparently admit outright.

And then, the socket activtion is nice for various reasons, and
lazy-loading is just one of them. The bigger advantage is that it does
automatic dependency handling -- which of course is nothign that really
fits into upstart's design, since that is based on "events" not
dependencies -- events are just broken, as I might note. And adding
dependency would turn around upstart's design, making it a completely
different beast.

I mean, you called socket activation "xinetd-style activation" in the
earlier mail of yours -- that is just completely besides the point,
because this all is not so much about doing on-demand starting of
internet services. systemd-style activation is about parallelizing
startup of (mostly) local services and making fully written dependencies
obsolete. And that's what is so magic about it. And an init systemd
should be designed around this at its core! And systemd is.

> 2) Bus activation. Missed opportunity here to actually become the launchpoint
> for activated services. I won't criticize that too much though, as its
> usefulness is largely dependent on kernelspace DBus, which I've been trying to
> bludgeon Marcel Holtmann into turning over to the public for a year
> now.

Not sure what this has to do with kernel space D-Bus, except that that
is practically dead.

If people want to reinvestigate the kernel/dbus issue they should not
focus on an AF_DBUS, but instead just use netlink and use BSD socket
filters for minimizing wakeups, plus come up with something inspired by
iptables/ebtables to filter netlink traffic, for the permission
problems. But that's a completely different story.

> 3) Cutting down on the forking by replacing some of the shell scripts... cool
>    3a) With C code... really?

Yes, really. MacOS could do it, and so can we. Its not that hard. And as
I my add here I already hacked up a big part of it now for the servcies
we start by default.

> 4) Process environment control. No complaints, and also nothing Upstart doesn't
> want to do.

Well, have you seen the functionality we provide thre, and the limited
stuff upstart has there?

And what about the other features? the automounter, a full blown
dependency system, depedencies between mounts, automounts, devices,
sockets, services, timers, paths, swaps? The snapshot system? The fact
that we take advantage of the LSB stuff right now, to paralleize the
boot, without changing a single file? All the process control stuff,
i.e. the syslog/kmsg hookup of stdout/stderr of every process we spawn?
The FS namespace stuff we do? The TTY stuff we do? the IO/CPU scheduler
tricks we do? The capability stuff we do? The CPU affinity stuff we do?
the full integration of /etc/fstab and friends? The transaction logic?
The crosslinking between systemd and syslog/abrt? the almost complete
dbus coverage? all the small btis we already took out of the shell boot
rpocess and already moved into proper C code in system (and some of it
into udevd)? the fact that systemd works for normal users/session
supervisor too? The fact that we support different boot targets? The
fact we have a UI? the sane copyright? The fact that we don't use bloody
bzr? And so on and so forth, i could repeat the whole blog story
here. Please just read that instead.

> > We did systemd because we thought that technically Upstart is
> > fundamentally flawed and misses out on so many opportunities.
> 
> And we think the same of systemd.

We?

OK, I am listening. What's flawed about systemd? I wrote very detailed
explanation why Upstart is just wrong in its core design. Please be so
kind and be more specific if you claim the same abotu systemd, because
otherwise all you are doing is spreading FUD.

> And mail it to upstart-devel-list. I'm being a bit hyperbolic, but actually not
> much. That would have been a valid move.

Please read my blog story. Thank you. It actually answers you that very
question:

"Why didn't you just add this to Upstart, why did you invent something
new?

    Well, the point of the part about Upstart above was to show that the
    core design of Upstart is flawed, in our opinion. Starting
    completely from scratch suggests itself if the existing solution
    appears flawed in its core. However, note that we took a lot of
    inspiration from Upstart's code-base otherwise."

If there is a valid reason for starting anew if something exists
already, then it is that the basic design of the existing system is just
flawed. And hence we started anew.

> You should really come work with us. We're fun guys.

Man, you are suggesting I didn't talk to the Upstart folks. That's a
ridiculous claim, and quite insulting that you imply that. Kay and I
have had long discussions with Scott, during the last three years or so,
at various confernces. And as far as I can see Upstart is Scott and
Scott is Upstart, and the bzr repo commits underline that. And I am
sorry if weren't part of those discussions.

You should really come work with us. We're even funnier guys.

Lennart

-- 
Lennart Poettering                        Red Hat, Inc.
lennart [at] poettering [dot] net
http://0pointer.net/lennart/           GnuPG 0x1A015CC4