A few weeks ago, I received a bug report regarding a Fedora package of ours, it was a request to have its init configuration migrated to systemd. A quick search within our Fedora repo shows systemd has become available starting with FC14, I guess it is about time we adapt our package. So we did so. Service definition is simple enough and the documentation is well done, it was really easy to use systemd to start our application daemon. There is a small lack of functionality within service definition to do exactly what we want at the installation configuration phase, but we've found a solution within systemd (which, while not perfect, works).
We now have our main RPM requiring a secondary sysvinit or systemd RPM according to distribution flavor. Nice and easy.
Reading about systemd features, I told myself, it could be the right tool to wake up an old project of mine exploiting containers kernel features and have the last Fedora (FC18) running within a container under a fresh kernel (3.9.4).
This little project gave satisfactory results with various distributions when I designed and tested it 2 years ago. First I checked it with a standard EL6.4 template (400 Megs) under this new kernel (3.9.4, HOST EL6.4) to see if my tool was still operational. Everything went perfectly. I was ready to test FC18. The selected FC18 template is a very standard one (a 939 MBytes tgz file) which (and this is a key factor) was proved to be fully working "as is" in an openvz container (kernel 2.6.32-042stab076.8). "as is" means that Template was never taylored to be on openvz container (template is used out of the box in openvz container) and could be used to seed a working HOST too.
I was not expecting to have it fully working at the first attempt in my own container design, but I was expecting systemd (using systemctl very detailed status) to give me a very good insight about issues which could occur.
The real goal was to learn how to use systemd components to diagnose an "in trouble" real system, a kind of flight simulator exercise, so that we would be ready in the future to do quick diagnosis if one of our server in a rack had trouble to boot or reboot with EL7.
If this exercise result is positive enough, why not try to install systemd within our current deployment as systemd is sysvinit compatible?
The exercise will be considered a success if I was able to log in a FC18 container from a remote location via SSH, the SSH port protected by the container own iptables (a very minimal number of services started, a "safe haven" mode to recover a system from trouble).
This small exercise turned out very ugly very quickly, I worked very hard trying all the tricks and bypass I could think about to collect data. To my dismay I was unable to get a predictable behaviour, nor reliable data from systemd, even in the emergency.service mode. After a while, I was forced to face it, systemd won't help me, not even start the system in a minimal mode, I was not able to go beyond kernel level with systemd in control, services started were a total mess and container was totaly lock up, with no exploitable data provided. (Quickly: we had interesting situation within the noisy and cold server room using the emergency.service console such as: $ systemctl start systemd-journald.service --> "unable to comply!" a dependency job for systemd-journald.service failed, see journactl -xn. $ journalctl -xn --> "unable to comply!" No journal files were found )
let's be blunt... from what I have seen: In a perfect world, systemd is obviously a nice gadget, in a real world, systemd is the perfect tools to transform a small problem in a terminal "cascading failure" event.
I sent a private email to Lennart about my 'little concern', giving more details and trying to explain as well as I could, suggesting solutions (mainly for brainstorming purpose).
Lennart answered quickly, and rejected my "worries" with a wave of the hand. To summarize, his answer was: "systemd can work only as PID1, you are out of spec, we do not support openvz, good luck".
Obviously, he didn't understand I wasN'T trying to run systemd on an openvz kernel, but rather on a plain 3.9.4 kernel neither was I requesting help to have FC18 running inside the container, I was rather pointing difficulties with systemd not able to cope with "hostiles conditions" init process duty. Troubles are by definition always 'out of spec'.
The part about "systemd can only works as PID1" increased my concerns by an order of magnitude. I ended up asking myself 'what part of this puzzle am I missing?', I digged around in Google about systemd and I was stunned by results, I found my concerns were already expressed multiple time with more talented words than mine and this as early as 2010. Since that time it is my understanding systemd continuously try to resolve problems by increasing its complexity and extending its dependencies and its centrality.
this is wrong, this is very very wrong. A program as complex as systemd can't be a mandatory PID1 in an open environment as UNIX.
We just defined a new oxymoron: "PID1 systemd".
This next paragraph in this email is dedicated to the RedHat person reading this mailing list as part of its "technology watch" duty.
===-- It is my understanding EL7 will include systemd as init process. In the actual working state of systemd and if included within EL7 as mandatory PID1, we won't deploy EL7 within our servers racks. Either we'll stay with EL6 or we'll move to another distribution (or another OS). Adding a kernel type program over a kernel is just moving big trouble troubleshooting process from a 4 solutions matrix (hardware+kernel) to a 8 solutions matrix (hardware+kernel+systemd) needed to be resolved before to be able to access and work on the system. Reading, via Google, tell me I am not the only one contemplating this very dilemma. --===----
BTW and to go a little bit beyond the systemd case, since 1991, FC18 is the very first distribution I was NOT successful in installing on a plain hardware (not speaking about container here, rather very plain hardware with RAID software disks. On the same hardware, same configuration parameters, EL6.X, Magia-3 and slackware-14.0 install is A_OK).
I am starting to wonder if we (this "we" include dev contributors and myself too) could be on the wrong path in the way we implement software in Fedora. To summarize, It is very easy to write code for an open platform, far more difficult to write code keeping the platform open. (but this is another story, maybe another time...:-}}).
That's all folks.... :-} did I say "Nay" to systemd?.
On 07/02/2013 04:08 PM, Jean-Marc Pigeon wrote:
I was not expecting to have it fully working at the first attempt in my own container design,
Would you be willing to provide some details about your container design? Ideally including the code to allow others to reproduce the problems you saw.
Have you seen these recommendations?: http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/
but I was expecting systemd (using systemctl very detailed status) to give me a very good insight about issues which could occur.
The real goal was to learn how to use systemd components to diagnose an "in trouble" real system, a kind of flight simulator exercise, so that we would be ready in the future to do quick diagnosis if one of our server in a rack had trouble to boot or reboot with EL7.
Interesting excersise, but I am afraid by running it in a custom container design and running under a host that itself is not using systemd you uncovered an entirely different class of problems than what can happen when running it on the host.
This small exercise turned out very ugly very quickly, I worked very hard trying all the tricks and bypass I could think about to collect data. To my dismay I was unable to get a predictable behaviour, nor reliable data from systemd, even in the emergency.service mode. After a while, I was forced to face it, systemd won't help me, not even start the system in a minimal mode, I was not able to go beyond kernel level with systemd in control, services started were a total mess and container was totaly lock up, with no exploitable data provided.
Not sure how much of it relates to container environments, but have you seen this?: http://freedesktop.org/wiki/Software/systemd/Debugging/
My first goal when debugging issues like this would be to make sure I can see the debugging output of systemd itself (i.e. with log_level set to debug and log_target to something I can read - probably "console" in the case of a container).
(Quickly: we had interesting situation within the noisy and cold server room using the emergency.service console such as: $ systemctl start systemd-journald.service --> "unable to comply!" a dependency job for systemd-journald.service failed, see journactl -xn.
This is when logging to "kmsg" (the dmesg buffer) or "console" can really help find out the problem.
I ended up asking myself 'what part of this puzzle am I missing?', I digged around in Google about systemd and I was stunned by results, I found my concerns were already expressed multiple time with more talented words than mine and this as early as 2010. Since that time it is my understanding systemd continuously try to resolve problems by increasing its complexity and extending its dependencies and its centrality.
this is wrong, this is very very wrong. A program as complex as systemd can't be a mandatory PID1 in an open environment as UNIX.
From the above paragraphs I get the feeling you may be missing the fact that not all of "systemd" runs in PID1. There are more components in the "systemd" project, such as journald, logind, ... - they run as separate processes. There is some ambiguity when talking about "systemd". Sometimes it refers only to the service manager (PID1), and sometimes to the whole suite.
BTW and to go a little bit beyond the systemd case, since 1991, FC18 is the very first distribution I was NOT successful in installing on a plain hardware
I heard F19 was released today with an improved Anaconda :-)
Michal
Thanks Michal, your answer was really positive and encourage me to proceed further.
So I have now an FC18 running within a container under an EL6.4 HOST with kernel 3.9.4 (big smile).
Problems starts to unlock themselves as I decided to bypass network.service altogether starting network and sshd manually (ifup lo; ifup eth0; /usr/sbin/sshd). Now able to work in a quiet room with multiple screens available to poke around and catch fast scrolling log messages. (you should never forget about the poor sysadmin freezing in front of the servers room console when your software is reporting a problem and not able to run :-}).
As expected the problem stand on a very small detail (within /etc/fstab)
Not working /vzgot / ext4 defaults 0 0 proc /proc proc defaults 0 0 sysfs /sys sysfs defaults 0 0 devpts /dev/pts devpts defaults 0 0 tmpfs /dev/shm tmpfs defaults 0 0
Working #/vzgot / ext4 defaults 0 0 proc /proc proc defaults 0 0 sysfs /sys sysfs defaults 0 0 devpts /dev/pts devpts defaults 0 0 tmpfs /dev/shm tmpfs defaults 0 0
The fact systemd was not able to cope with this /etc/fstab is quite acceptable, (even if upstart and init have no problem with it), The fact such small trouble drives systemd to an emergency state without reporting clearly is another question. When the last prominent line before asking for maintenance password is about, "Not able to exec /bin/plymouth, <no such file>" you are asking yourself in what mess am I in. The fact that the line just below says, "Please see journal" but journal is not available (empty) just compound the effect.
Once I was able to log via remote SSH in emergency.service mode, I played with different services, trying to "ignore-dependencies" but never got a clear message about what was missing. Success was more a lucky guess than the result from a structured approach.
So, no, sorry, systemd doesn't grade "production level" (not yet? or never?).
May I propose some way to improve it. - journal should be accessible regardless of systemd status or trouble. - when list-dependencies service is displayed, you should mark dependencies already running (or not successfully started?), think about the poor sysadmin!. - You should have a way to proceed in a 'step by step' boot mode (avoiding in parallel fast scrolling report)
- On a more philosophical side: * linking PID1 and systemd seems to me a problem (why it is mandatory still escape me), you are limiting your trouble shooting context (double check your design). * the fact systemd is catching more and more functionality to be working should trigger a loud alarm signal about your design (did I understand today's mail correctly?, you can't use logrotate to expire/archive journal.... :/ )
Bug: - After a very quick check, there is maybe a bug the way systemd is handling 'int reboot(int cmd);', I have the strong feeling systemd is not feeding WTERMSIG(status), but it is very preliminary, I could be wrong....
As your request,I can provide you with "vzgot", my container application (which flavor/distribution RPM do you want? src.rpm is available too). While not a fork of LXC, I think vzgot is very close to LXC about the way the container is started, difference is more about container definition, with vzgot, you just need a DNS resolution (for the container's IPs) and a config_list, linking container name to a distribution name, a template name and an architecture. With that data, vzgot is able to create a running container by itself. I tried to have the container setup as lean, simple and flexible as possible.
I put that project in sleep mode, because a trouble I reported 3 years ago (a syslog+printk cross leakage between HOST and containers) seems to be very difficult to address within the kernel. But!... very good news yesterday!, problem is fixed within kernel 3.10.0, maybe it is time to work on vzgot again?.
Quoting Michal Schmidt mschmidt@redhat.com:
On 07/02/2013 04:08 PM, Jean-Marc Pigeon wrote:
I was not expecting to have it fully working at the first attempt in my own container design,
Would you be willing to provide some details about your container design? Ideally including the code to allow others to reproduce the problems you saw.
Have you seen these recommendations?: http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/
but I was expecting systemd (using systemctl very detailed status) to give me a very good insight about issues which could occur.
The real goal was to learn how to use systemd components to diagnose an "in trouble" real system, a kind of flight simulator exercise, so that we would be ready in the future to do quick diagnosis if one of our server in a rack had trouble to boot or reboot with EL7.
Interesting excersise, but I am afraid by running it in a custom container design and running under a host that itself is not using systemd you uncovered an entirely different class of problems than what can happen when running it on the host.
This small exercise turned out very ugly very quickly, I worked very hard trying all the tricks and bypass I could think about to collect data. To my dismay I was unable to get a predictable behaviour, nor reliable data from systemd, even in the emergency.service mode. After a while, I was forced to face it, systemd won't help me, not even start the system in a minimal mode, I was not able to go beyond kernel level with systemd in control, services started were a total mess and container was totaly lock up, with no exploitable data provided.
Not sure how much of it relates to container environments, but have you seen this?: http://freedesktop.org/wiki/Software/systemd/Debugging/
My first goal when debugging issues like this would be to make sure I can see the debugging output of systemd itself (i.e. with log_level set to debug and log_target to something I can read - probably "console" in the case of a container).
(Quickly: we had interesting situation within the noisy and cold server room using the emergency.service console such as: $ systemctl start systemd-journald.service --> "unable to comply!" a dependency job for systemd-journald.service failed, see journactl -xn.
This is when logging to "kmsg" (the dmesg buffer) or "console" can really help find out the problem.
I ended up asking myself 'what part of this puzzle am I missing?', I digged around in Google about systemd and I was stunned by results, I found my concerns were already expressed multiple time with more talented words than mine and this as early as 2010. Since that time it is my understanding systemd continuously try to resolve problems by increasing its complexity and extending its dependencies and its centrality.
this is wrong, this is very very wrong. A program as complex as systemd can't be a mandatory PID1 in an open environment as UNIX.
From the above paragraphs I get the feeling you may be missing the fact that not all of "systemd" runs in PID1. There are more components in the "systemd" project, such as journald, logind, ...
- they run as separate processes. There is some ambiguity when
talking about "systemd". Sometimes it refers only to the service manager (PID1), and sometimes to the whole suite.
BTW and to go a little bit beyond the systemd case, since 1991, FC18 is the very first distribution I was NOT successful in installing on a plain hardware
I heard F19 was released today with an improved Anaconda :-)
Michal
On Tue, 02.07.13 16:57, Jean-Marc Pigeon (jmp@safe.ca) wrote:
As expected the problem stand on a very small detail (within /etc/fstab)
Not working /vzgot / ext4 defaults 0 0 proc /proc proc defaults 0 0 sysfs /sys sysfs defaults 0 0 devpts /dev/pts devpts defaults 0 0 tmpfs /dev/shm tmpfs defaults 0 0
Working #/vzgot / ext4 defaults 0 0 proc /proc proc defaults 0 0 sysfs /sys sysfs defaults 0 0 devpts /dev/pts devpts defaults 0 0 tmpfs /dev/shm tmpfs defaults 0 0
Note that in a systemd world fstab shouldn't really list any of the virtual file systems like procfs, sysfs, devpts, /dev/shm, unless you have specific mount options that need to override the defaults. Also, the root file system doesn't need to be listed. It is hence a good idea to leave fstab out entirely if you run things in a container.
The fact that the line just below says, "Please see journal" but journal is not available (empty) just compound the effect.
How did you access the journal? The journal is actually available pretty much all the time. It logs to /run as long as /var is not there, to make this work (very much unlike classic syslog, btw).
So, no, sorry, systemd doesn't grade "production level" (not yet? or never?).
Well, as mentioned you altered the most low-level parts of the unit dep tree. So yeah, a setup like that certainly is not "production level", but that's hardly our fault.
May I propose some way to improve it.
- journal should be accessible regardless of systemd status or
trouble.
It is. journalctl directly accesses all journal files and starts very early in the boot process, including in initrd (hint: this is *much* earlier than classic syslog). And for the time even before the journal is around, we log to kmsg (i.e. demsg).
- You should have a way to proceed in a 'step by step' boot mode (avoiding in parallel fast scrolling report)
systemd.confirm_spawn=yes
But disabling the parallelization doesn't really work. If a service foo triggers starting of a service bar while it is starting up, and needs an answer from bar before it can proceed, how do you want to ever solve this? You need to start both foo and bar at the same time.
- On a more philosophical side:
- linking PID1 and systemd seems to me a problem (why it is
mandatory still escape me),
systemd is an init system. init systems run as PID 1. This how Unix works.
Bug:
- After a very quick check, there is maybe a bug the way systemd is
handling 'int reboot(int cmd);', I have the strong feeling systemd is not feeding WTERMSIG(status), but it is very preliminary, I could be wrong....
Hmm? I cannot parse this.
Lennart
Quoting Lennart Poettering mzerqung@0pointer.de:
On Tue, 02.07.13 16:57, Jean-Marc Pigeon (jmp@safe.ca) wrote:
As expected the problem stand on a very small detail (within /etc/fstab)
Note that in a systemd world fstab shouldn't really list any of the virtual file systems like procfs, sysfs, devpts, /dev/shm, unless you have specific mount options that need to override the defaults. Also, the root file system doesn't need to be listed. It is hence a good idea to leave fstab out entirely if you run things in a container.
I beg to disagree, As I want to keep the container as close as to a pristine installed distribution and as /etc/fstab is part of it once installed, fstab must be present in the container.
The fact that the line just below says, "Please see journal" but journal is not available (empty) just compound the effect.
How did you access the journal? The journal is actually available pretty much all the time. It logs to /run as long as /var is not there, to make this work (very much unlike classic syslog, btw).
Just reporting observation. there is condition where you have ""Please see journal" but journal is empty.
So, no, sorry, systemd doesn't grade "production level" (not yet? or never?).
Well, as mentioned you altered the most low-level parts of the unit dep tree. So yeah, a setup like that certainly is not "production level", but that's hardly our fault.
This issue already covered in previous email,
May I propose some way to improve it.
- journal should be accessible regardless of systemd status or
trouble.
It is. journalctl directly accesses all journal files and starts very early in the boot process, including in initrd (hint: this is *much* earlier than classic syslog). And for the time even before the journal is around, we log to kmsg (i.e. demsg).
- You should have a way to proceed in a 'step by step' boot mode (avoiding in parallel fast scrolling report)
systemd.confirm_spawn=yes
Didn't notice this. not sure is what I think we need. need to check.
But disabling the parallelization doesn't really work. If a service foo triggers starting of a service bar while it is starting up, and needs an answer from bar before it can proceed, how do you want to ever solve this? You need to start both foo and bar at the same time.
You need a step by step to sort out problem, instead to flush all data on the console, this give time to sysadmin to see what is happening (very good when problem is ocuuring very early in the process) In step by step I would LOCK in such situation as "bar need foo AND foo need bar", seems to me a deadly embrace situation definition, and prone to "timing issue" subtle problem. Systemd should detect such situation an complain about it. STEP by STEP mode is obviously a debug mode.
- On a more philosophical side:
- linking PID1 and systemd seems to me a problem (why it is
mandatory still escape me),
systemd is an init system. init systems run as PID 1. This how Unix works.
Yes and I agree, But according my understanding of systemd, many function done by systemd do not need to be PID1. In fact complexity and smart action could be moved away from PID1 process keeping PID1 part, lean and simple. Such way, you could start systemd "smart and interesting" part from a shell script (which open a lot flexibility an robustness).
Bug:
- After a very quick check, there is maybe a bug the way systemd is
handling 'int reboot(int cmd);', I have the strong feeling systemd is not feeding WTERMSIG(status), but it is very preliminary, I could be wrong....
Hmm? I cannot parse this.
OK... when within the container a reboot is issued by admin, there is a way to advise container superviser (the process above systemd), this is done by reporting a signal status. 'plain init' and 'upstart' are doing that properly, seems to me systemd is not doing it... I didn't double check this, very prelimary report. (hoping this time you can parse my explaination).
Many Thanks Lennart.
Lennart
-- Lennart Poettering - Red Hat, Inc. -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel
On Tue, 02.07.13 10:08, Jean-Marc Pigeon (jmp@safe.ca) wrote:
This little project gave satisfactory results with various distributions when I designed and tested it 2 years ago. First I checked it with a standard EL6.4 template (400 Megs) under this new kernel (3.9.4, HOST EL6.4) to see if my tool was still operational. Everything went perfectly. I was ready to test FC18. The selected FC18 template is a very standard one (a 939 MBytes tgz file) which (and this is a key factor) was proved to be fully working "as is" in an openvz container (kernel 2.6.32-042stab076.8). "as is" means that Template was never taylored to be on openvz container (template is used out of the box in openvz container) and could be used to seed a working HOST too.
So, as you mentioned earlier you used this recipe to set up your image:
https://gist.github.com/fabaff/5512671
This recipe is very broken (which I already told you), and by using this, you create an OS image that changes a number of early boot things in a way that will break things if you then try to boot the same image on bare metal.
The things it changes are early-boot things, very low-level stuff. You interfered with much of the most basic OS initialization stuff there (masking sysinit.target!), and if you do this then you really need to know what you are doing. And you should not expect that the same image will then continue to boot on normal hardware.
I understand you are new to systemd, but you chose to alter the lowest level bits of the OS, and then were subsequently lost then. This is certainly very understandable, but please accept that this is not our primary use case. We assume that if you touch that kind of low-level stuff, and alter the early-boot dep tree, then you know how to help yourself. The more low-level you go the more expertise you need.
We are working on making systemd work out-of-the-box on container managers. libvirt-lxc and systemd-nspawn are relatively nicely integrated with systemd so that things just work without any manual recipe. OpenVZ is not something we test against, and we certainly do not test systems that have been modified to work with OpenVZ but then are attempted to be booted on bare-metal hardware again.
From the systemd side it is our goal to ensure that systemd will work
fine on bare metal, inside a VM and in a container, with the exact same image, without any ateration. With nspawn/libvirt-lxc we are very close to that. However, all that work will be incomplete, unless Fedora as a whole starts to care about this (for example, by giving the various container managers a role like a "release architecture", to ensure they just work with unmodified future fedora releases), and unless the various container managers actually start to implement the most basic common interfaces like the ones we documented:
http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/
So, please work with Fedora, with the container vendor of your choice to make all this stuff just work out-of-the-box. And please don't use recipes like the one you linked, they made things worse, not better. You don't need any recipes at all if your container manager just implements the ContainerInterface mini spec...
Being able to migrate OS images between VMs, bare metal, containers, in all directions without alteration is absolutely a worthy goal, and not far off, if people actually start caring.
Lennart
FC18 template is a very standard one (a 939 MBytes tgz file) which (and this is a key factor) was proved to be fully working "as is" in an openvz container (kernel 2.6.32-042stab076.8). "as is" means that Template was never taylored to be on openvz container (template is used out of the box in openvz container) and could be used to seed a working HOST too.
So, as you mentioned earlier you used this recipe to set up your image:
How Many time need I specify I didn't use that solution (it was a bad solution), I sent you this reference to tell you some other must certainly have a problem too and find (which I think is) a none satisfactory solution... Here are exact wording I used in my personal email to you about my trouble
"""" 25/06/2013 Via google, I was able to find a way to have fc18 working under LXC. The proposed way was to eviscerate many services unit from systemd (ln -s /dev/null /etc/systemd/system/udev-settle.service, etc..), see https://gist.github.com/fabaff/5512671. Applying such dramatique measure make fc18 container indeed working, but such solution are not good for both systemd and LXC and prove I am not the only one not able to figure out systemd setup needs. """"
Its seems at least we agree it was not a good solution. I want to keep the release as pristine as possible within a container (my kind is an IaaS, see yesterday James Bottomley email about different meaning for the word 'container')
Being able to migrate OS images between VMs, bare metal, containers, in all directions without alteration is absolutely a worthy goal, and not far off, if people actually start caring.
I fully agree. Problem is not here.
Lennart
-- Lennart Poettering - Red Hat, Inc. -- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel