On Mon, 2012-10-22 at 09:06 +0100, John wrote:
On 22/10/12 03:06, Michael H. Warfield wrote:
On Mon, 2012-10-22 at 02:53 +0200, Kay Sievers wrote:
On Sun, Oct 21, 2012 at 11:25 PM, Michael H. Warfield mhw@wittsend.com wrote:
This is being directed to the systemd-devel community but I'm cc'ing the lxc-users community and the Fedora community on this for their input as well. I know it's not always good to cross post between multiple lists but this is of interest to all three communities who may have valuable input.
I'm new to this particular list, just having joined after tracking a problem down to some systemd internals...
Several people over the last year or two on the lxc-users list have been discussions trying to run certain distros (notably Fedora 16 and above, recent Arch Linux and possibly others) in LXC containers, virualizing entire servers this way. This is very similar to Virtuoso / OpenVZ only it's using the native Linux cgroups for the containers (primary reason I dumped OpenVZ was to avoid their custom patched kernels). These recent distros have switched to systemd for the main init process and this has proven to be disastrous for those of us using LXC and trying to install or update our containers.
To put it bluntly, it doesn't work and causes all sorts of problems on the host.
To summarize the problem... The LXC startup binary sets up various things for /dev and /dev/pts for the container to run properly and this works perfectly fine for SystemV start-up scripts and/or Upstart. Unfortunately, systemd has mounts of devtmpfs on /dev and devpts on /dev/pts which then break things horribly. This is because the kernel currently lacks namespaces for devices and won't for some time to come (in design). When devtmpfs gets mounted over top of /dev in the container, it then hijacks the hosts console tty and several other devices which had been set up through bind mounts by LXC and should have been LEFT ALONE.
Yes! I recognize that this problem with devtmpfs and lack of namespaces is a potential security problem anyways that could (and does) cause serious container-to-host problems. We're just not going to get that fixed right away in the linux cgroups and namespaces.
How do we work around this problem in systemd where it has hard coded mounts in the binary that we can't override or configure? Or is it there and I'm just missing it trying to examine the sources? That's how I found where the problem lay.
As a first step, this probably explains most of it: http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface
A very long ways, yeah. That looks like it could be just what we've been looking for. Just gotta figure out how to set that environment variable but that's up to a couple of others to comment on in the lxc-users list. Then we'll see where we go from there.
Many thanks!
Kay
Regards, Mike
I've just performed a very quick check on my Arch Linux system here.
on host (running systemd): # cat /proc/1/environ TERM=linuxRD_TIMESTAMP=
In a container (running sysvinit): # cat /proc/1/environ STY=623.systemd-lithiumTERM=screenTERMCAP=SC|screen|VT 100/ANSI X3.64 virtual terminal:\ :DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\ :cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\ :do=^J:nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\ :le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\ :li#24:co#80:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\ :cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\ :im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\ :ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\ :ti=\E[?1049h:te=\E[?1049l:k0=\E[10~:k1=\EOP:k2=\EOQ:\ :k3=\EOR:k4=\EOS:k5=\E[15~:k6=\E[17~:k7=\E[18~:\ :k8=\E[19~:k9=\E[20~:k;=\E[21~:F1=\E[23~:F2=\E[24~:\ :kh=\E[1~:@1=\E[1~:kH=\E[4~:@7=\E[4~:kN=\E[6~:kP=\E[5~:\ :kI=\E[2~:kD=\E[3~:ku=\EOA:kd=\EOB:kr=\EOC:kl=\EOD:WINDOW=0SHELL=/bin/shPATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binLANG=en_GB.UTF-8container=lxc
So it looks like that "container" environment variable is already set on PID1
Yeah, I saw that myself last night. Testing that out and it's still not working here (although it doesn't seem to be grabbing the host console now) if I use systemd but upstart fires right up and I see that container variable set. Looked like a number of mounts listed on that wiki page. Maybe something is missing. Right now it's just hanging trying to start the container and, when I subsequently try to shut the container down it results in a hung resource and it can't delete the cgroups directory because it's busy. Only thing I did was change the link to /sbin/init from upstart to systemd and it's now dead and I'll have to reboot the host to free the resource. :-P
Regards, John
Regards, Mike