It looks like after the last systemd update, systemd appears to start jobs that have a dependency on network availability before the network is actually up.
After booting, a crapload of services reliably fail to start: httpd, dhcpd, named, and others. All of them come up fine if I manually start them, after boot completes.
Sifting through /var/log/messages, the common thread is that the network is NOT up, when the jobs start:
Jul 5 07:16:41 shorty httpd: (99)Cannot assign requested address: AH00072: make_sock: could not bind to address 216.27.136.223:8443 Jul 5 07:16:41 shorty httpd: no listening sockets available, shutting down Jul 5 07:16:41 shorty httpd: AH00015: Unable to open logs
And:
Jul 5 07:16:40 shorty dhcpd: Jul 5 07:16:40 shorty dhcpd: No subnet declaration for lan0 (no IPv4 addresses). Jul 5 07:16:40 shorty dhcpd: ** Ignoring requests on lan0. If this is not what Jul 5 07:16:40 shorty dhcpd: you want, please write a subnet declaration Jul 5 07:16:40 shorty dhcpd: in your dhcpd.conf file for the network segment Jul 5 07:16:40 shorty dhcpd: to which interface lan0 is attached. ** Jul 5 07:16:40 shorty dhcpd: Jul 5 07:16:40 shorty dhcpd: Jul 5 07:16:40 shorty dhcpd: Not configured to listen on any interfaces!
Here's bind shutting down before the server reboot:
Jul 5 07:14:30 shorty named[1072]: no longer listening on 127.0.0.1#53 Jul 5 07:14:30 shorty named[1072]: no longer listening on 192.168.0.1#53 Jul 5 07:14:30 shorty named[1072]: no longer listening on 216.254.115.190#53
Note all the IP addresses that my bind is listening on. Now, here's bind again, when the system comes up after a reboot:
Jul 5 07:16:39 shorty named[1071]: loading configuration from '/etc/named.conf' Jul 5 07:16:39 shorty named[1071]: using default UDP/IPv4 port range: [1024, 65535] Jul 5 07:16:39 shorty named[1071]: using default UDP/IPv6 port range: [1024, 65535] Jul 5 07:16:39 shorty named[1071]: listening on IPv4 interface lo, 127.0.0.1#53 Jul 5 07:16:39 shorty named[1071]: generating session key for dynamic DNS …
That's it. Only 127.0.0.1 was up for bind to listen on. Even lan0, 192.168.0.1, isn't up yet.
Sifting through systemd docs, it appears that services that require network availability should state an After dependency on "network-online.target". Wonderful. Can someone explain why, if that's true, only kdump.service appears to explicitly state that dependency?
[root@shorty system]# fgrep -- network-online.target *.target [root@shorty system]# fgrep -- network-online.target *.service kdump.service:After=network.target network-online.target remote-fs.target
So, I see two possible explanations:
1) network-online.target was introduced in the most recent systemd update. Prior to that, its equivalent functionality was done in some other way, possibly integrated with network.target. The last update unceremoniously introduced network-online.target, with no advance notice, and without any coordination with packages that need to use it, and thus breaking everything.
2) everything was always like that. Everything was always broken, but until the recent systemd update, its internal logic happened to kick things off in the right order. The last update changed some internal scheduling logic, and now everything is broken.
So, how should this mess get fixed? Start filing bugs against all these packages, requesting a change to their systemd service file, to state a dependency on network-online.target?
On Sat, 05 Jul 2014 08:13:45 -0400 Sam Varshavchik wrote:
Everything was always broken
I'm pretty sure everything was always broken. I never had the combination of postfix, dovecot, and stunnel operational more than about 10% of the time with pure systemd.
I just took a more practical approach and made an rc.local script that restarted services I found not always coming up right after a short delay. This sort of thing:
/bin/bash -c 'sleep 5 ; service stunnel restart' > /dev/null 2>&1 < /dev/null & /bin/bash -c 'sleep 7 ; service postfix restart' > /dev/null 2>&1 < /dev/null &
That at least works up to the day systemd decides no one needs rc.local and they drop support for it (a day that is sure to come :-).
Tom Horsley writes:
On Sat, 05 Jul 2014 08:13:45 -0400 Sam Varshavchik wrote:
Everything was always broken
I'm pretty sure everything was always broken. I never had the combination of postfix, dovecot, and stunnel operational more than about 10% of the time with pure systemd.
For me, everything seemed to work fairly reliably until about a week ago, which, coincidentally, is when the last systemd update got pushed out. I'm pretty sure that entropy is in full control here; for a particular system, depending on which services it has installed and enabled, determining the order in which systemd will kick things off, and, together with some additional entropy-driven factors, like how long each start script runs, before it stops and systemd kicks off the next one – that determines, in the end, whether things will get started in the order that results in a working server, or a hopelessly bolloxed server. And, some minor change in systemd's internal logic made my entropy effect change from "always happens to work" to "always happens not to work". Everything was fine until a week or so, ago, maybe even a little longer – perhaps the previous, rather than the most recent systemd update. I don't recall. Everything was hunky dory, and now:
named only listens on 127.0.0.1, at boot innd only listens on 127.0.0.1, at boot httpd gives up and dies dhcpd gives up and dies (lots of fun on my LAN) privoxy gives up and dies (more fun)
Perhaps, technically, it's all these packages fault, for not installing the correct service file. Still, I'll point the finger at systemd. It's a direct consequence of its buzzword-compliant, but fundamentally broken design.
You could look at the old rc?.d directory, and at a glance see exactly what gets started at system boot, and in which order. Now, it's a big mystery, shrouded in a dark cloud. One of my servers does not use plymouth, and has boot messages turned on, and the number of errors spewed by systemd at boot time is quite impressive. The number of packages that have a systemd file that's broken in some way is staggering. So, is it all their fault, or systemd's?
That at least works up to the day systemd decides no one needs rc.local and they drop support for it (a day that is sure to come :-).
You can bet on it.
Actually, I think we do have a sliver of hope, now that systemd has infested RHEL. As RHEL 7 rolls out, complete with the systemd clusterfrak, it's now going to get some exposure to Red Hat's paying customers. Expect some noise to slowly increase, in volume, over time. So, as RHEL 7 roll-out progresses over time, I think that it's going to end up dooming systemd. It's just a matter of time. I'm going to wager a 100 quatloos that we'll start hearing some talk about replacing systemd with something that actually works in, maybe, 3-4 years' time.
On Sat, 05 Jul 2014 08:58:23 -0400 Sam Varshavchik wrote:
it's now going to get some exposure to Red Hat's paying customers
And since the 'E' in RHEL is "Enterprise", I suspect many of those customers are going to wonder why systemd needs to hang around and consume vast amounts of resources when they don't tote their computer racks around from one starbucks to another and don't need huge resources dedicated to dynamically responding to changes that are never gonna happen once the system is up.
On 07/05/2014 02:58 PM, Sam Varshavchik wrote: ...
Actually, I think we do have a sliver of hope, now that systemd has infested RHEL. As RHEL 7 rolls out, complete with the systemd clusterfrak, it's now going to get some exposure to Red Hat's paying customers. Expect some noise to slowly increase, in volume, over time. So, as RHEL 7 roll-out progresses over time, I think that it's going to end up dooming systemd. It's just a matter of time. I'm going to wager a 100 quatloos that we'll start hearing some talk about replacing systemd with something that actually works in, maybe, 3-4 years' time.
You certainly have a vivid imagination, capt'n James Tiberius Kirk. :)
poma
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/05/14 20:13, Sam Varshavchik wrote:
So, how should this mess get fixed? Start filing bugs against all these packages, requesting a change to their systemd service file, to state a dependency on network-online.target?
FWIW, I'm running a fully updated F20 system and not seeing any problems for httpd and named
Last login: Sat Jul 5 21:46:26 CST 2014 on pts/0 [root@f20f ~]# uptime 21:59:25 up 0 min, 1 user, load average: 1.33, 0.44, 0.15
[root@f20f ~]# journalctl -b -u httpd.service - -- Logs begin at Thu 2013-12-19 06:42:00 CST, end at Sat 2014-07-05 21:59:21 CST. Jul 05 21:59:00 f20f.greshko.com systemd[1]: Starting The Apache HTTP Server... Jul 05 21:59:02 f20f.greshko.com systemd[1]: Started The Apache HTTP Server.
[root@f20f ~]# journalctl -b -u named-chroot.service - -- Logs begin at Thu 2013-12-19 06:42:00 CST, end at Sat 2014-07-05 21:59:49 CST. Jul 05 21:58:57 f20f.greshko.com systemd[1]: Starting Berkeley Internet Name Domai Jul 05 21:58:59 f20f.greshko.com named-checkconf[1270]: zone 0.0.127.in-addr.arpa/ Jul 05 21:58:59 f20f.greshko.com named-checkconf[1270]: zone 1.168.192.in-addr.arp Jul 05 21:58:59 f20f.greshko.com named-checkconf[1270]: zone 128.75.211.in-addr.ar Jul 05 21:58:59 f20f.greshko.com named-checkconf[1270]: zone greshko.com/IN: loade
I also run with NetworkManager-wait-online.service enabled. There was a specific reason I started running with that enabled....don't remember why. But, you may want to check that.
- -- If you can't laugh at yourself, others will gladly oblige.
Ed Greshko writes:
On 07/05/14 20:13, Sam Varshavchik wrote:
So, how should this mess get fixed? Start filing bugs against all these
packages, requesting a change to their systemd service file, to state a dependency on network-online.target?
FWIW, I'm running a fully updated F20 system and not seeing any problems for httpd and named
Neither did I, until either the last, or the next to last, systemd update.
I also run with NetworkManager-wait-online.service enabled. There was a specific reason I started running with that enabled....don't remember why. But, you may want to check that.
The server with dhcp, httpd, named, and privoxy does not have NetworkManager installed. Both the WAN and the LAN ports are configured as static IPs.
The server with innd installed has NetworkManager, so I could theoretically enable it there.
http://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ documents an alternative target, systemd-networkd-wait-online.service, which does not appear to actually exist anywhere, and is not installed by any package.
The more I dig into the config files, the bigger the clusterfark this appears to be.
The starting point is the above documentation for network.target and network- online.target. The above is supposed to be the authoritative documentation, directly referenced from the man pages. Starting with that, I look at what network-online.target actually says:
[Unit] Description=Network is Online Documentation=man:systemd.special(7) Documentation=http://www.freedesktop.org/wiki/Software/systemd/NetworkTarget After=network.target
It doesn't do anything, it's just a symbolic target. That's fine, so intent is that stuff that actually needs network connections should declare "After=network-online.target". Then, whatever system service is responsible for initializing the static network connections would declare both "After=network.target" and "Before=network-online.target", so it runs after basic networking is up. Once it succeeds in initializing the network connections, it terminates, network-online.target now gets reached, and all the services that depend on established network connections can now run. That seems to be the desired strategy.
Sounds great. This is actually not a such a bad plan of action. It might actually make sense, presuming that all servers that depend on established network connections would specify "After=network-online.target", and not "After=network.target", as they do now. Of course, as I discovered, only kdump.service actually does this. So, this is the first thing that goes off the rails. But the rest of the train quickly follows:
Now, given the initial design, one would automatically assume that NetworkManager-wait-online.service would follow the master plan, and specify "After=network.target" and "Before=network-online.target", putting all the jigsaw pieces in the correct order. But no, this is what NetworkManager-wait- online.service actually says:
[Unit] Description=Network Manager Wait Online Requisite=NetworkManager.service After=NetworkManager.service Wants=network.target Before=network.target network-online.target
It specifies that it should be reached /before/ *both* network.target and network-online.target, rather than after network.target, and before network- online.target.
This really looks like somebody just said "eh, I'm just too lazy to fix all services that should really be executed after reaching network- online.target, I'm just going to fix this by executing NetworkManager-wait- online.service before network.target is reached, and before all the servers that currently require network.target get forked off".
Brilliant.
So, enabling NetworkManager-wait-online.service is required on servers that run dhcp, named, httpd, and other servers. If it's not enabled, a roll of the dice will determine whether any of them will come up properly. And I'll bet none of these RPMs enable it, which is needed for this hack to work. And, if NetworkManager is not enabled, with all network interfaces being initialized to static IPs in /etc/sysconfig/network-scripts, I don't see a way to get this right. It may or may not work, depending on the order systemd chooses to execute scripts, and how long they take. Even the kernel version could be a factor – how long the kernel takes to initialize each network interface.
And the documented alternative, "systemd-networkd-wait-online.service", is still nowhere to be found. yum whatprovides comes up empty.
It should be fun watching all of this implode from the sidelines, as all servers running DHCP and httpd get updated to RHEL 7. Some of them will be fine. Some of them will randomly fail to come up fully. Those that do manage to work initially, at some point later a systemd update, or a kernel update, will subtly change the order in which stuff gets forked off from systemd, and suddenly break it.
Lots of fun.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/06/14 01:00, Sam Varshavchik wrote:
Ed Greshko writes:
On 07/05/14 20:13, Sam Varshavchik wrote:
So, how should this mess get fixed? Start filing bugs against all these packages, requesting a change to their systemd service file, to state a dependency on network-online.target?
FWIW, I'm running a fully updated F20 system and not seeing any problems for httpd and named
Neither did I, until either the last, or the next to last, systemd update.
As I mentioned. Fully updated. So I think this is the same for you.
[root@f20f ~]# rpm -q systemd systemd-208-19.fc20.x86_64 [root@f20f ~]# uname -r 3.14.9-200.fc20.x86_64
I also run with NetworkManager-wait-online.service enabled. There was a specific reason I started running with that enabled....don't remember why. But, you may want to check that.
The server with dhcp, httpd, named, and privoxy does not have NetworkManager installed. Both the WAN and the LAN ports are configured as static IPs.
You may want to try NetworkManager and wait-online. WAN links can take time to become active.
I suppose the bottom line is I can't confirm your issue. I've not made any changes to the default systemd config or service files.
- -- If you can't laugh at yourself, others will gladly oblige.
Ed Greshko writes:
The server with dhcp, httpd, named, and privoxy does not have
NetworkManager installed. Both the WAN and the LAN ports are configured as static IPs.
You may want to try NetworkManager and wait-online. WAN links can take time to become active.
NetworkManager does nothing for me, except to suck in another 16 packages it depends on, that I don't need. The WAN link is a statically-routed IP traffic. It's up as soon as the network interface brings up the link.
I suppose the bottom line is I can't confirm your issue. I've not made any changes to the default systemd config or service files.
Neither did I, yet it broke a week or so ago. And I can now see exactly what's broken, and it was always a broken systemd configuration; except that solely by the roll of the dice, it worked until recently.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/06/14 07:28, Sam Varshavchik wrote:
Ed Greshko writes:
The server with dhcp, httpd, named, and privoxy does not have NetworkManager installed. Both the WAN and the LAN ports are configured as static IPs.
You may want to try NetworkManager and wait-online. WAN links can take time to become active.
NetworkManager does nothing for me, except to suck in another 16 packages it depends on, that I don't need. The WAN link is a statically-routed IP traffic. It's up as soon as the network interface brings up the link.
Now that you re-read what you wrote I think misunderstood WAN to be a wireless connection which would require a time to bring up the link with the AP. If you mean WAN in the sense of a Wide Area Network, I wonder if there is a need for PPP negotiation.
I suppose the bottom line is I can't confirm your issue. I've not made any changes to the default systemd config or service files.
Neither did I, yet it broke a week or so ago. And I can now see exactly what's broken, and it was always a broken systemd configuration; except that solely by the roll of the dice, it worked until recently.
Well, if you know exactly what is broken and how it is broken, then I suppose it is time for bugzilla.
- -- If you can't laugh at yourself, others will gladly oblige.
Ed Greshko writes:
On 07/06/14 07:28, Sam Varshavchik wrote:
Ed Greshko writes:
The server with dhcp, httpd, named, and privoxy does not have
NetworkManager installed. Both the WAN and the LAN ports are configured as static IPs.
You may want to try NetworkManager and wait-online. WAN links can take
time to become active.
NetworkManager does nothing for me, except to suck in another 16 packages
it depends on, that I don't need. The WAN link is a statically-routed IP traffic. It's up as soon as the network interface brings up the link.
Now that you re-read what you wrote I think misunderstood WAN to be a wireless connection which would require a time to bring up the link with the AP. If you mean WAN in the sense of a Wide Area Network, I wonder if there is a need for PPP negotiation.
I didn't say PPP either. My WAN is not some screwed up pseudo-Internet connection that uses ppp over some godforsaken transport.
It is really a connection directly to the intertubes. As in: it swallows IP packets. The ISP on the other side sends packets to my IP addresses to my link, and I get them.
/etc/sysconfig/ifcfg-wan0 is simply:
TYPE=Ethernet BOOTPROTO=none IPADDR=216.254.115.190 ONBOOT=yes
plus a few miscellaneous settings that are not relevant.
None of this needs NetworkManager. The link comes up immediately, as soon as the network port powers up.
Similarly, the LAN0 port also has a hardwired IP address on it, and dhcpd is supposed to sink its claws into it, and manage my LAN.
That's, pretty much, how servers talk to the intertubes from a data center.
And now, any plain, garden-variety server that's hooked up directly to the intertubes, and does NOT bind to a TCP or a UDP port using a wildcard IP is now going to roll the dice: heads, and systemd happens to be busy before it gets around to it, and the network ports manage to finish initializing before the server gets forked off; tails, and it's hopelessly fracked up, because systemd is now going to fork it off before the network port has finished initializing.
I don't see anything in /lib/systemd/system that directly executes /etc/rc.d/init.d/network. So, it appears that this is systemd internal magic – it forks off /etc/rc.d/init.d/network, and marks network.target as being reached without waiting for this initscript to terminate. You know: to make boot faster. What a great idea.
So, systemd now becomes very busy forking off dhcp, named, httpd, innd, and whatever else declares after=network.target, before /etc/rc.d/init.d/network finishes sifting /etc/sysconfig/network-scripts, and finishes bringing up all the network interfaces.
Hilarity ensues.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/06/14 08:32, Sam Varshavchik wrote:
Ed Greshko writes:
On 07/06/14 07:28, Sam Varshavchik wrote:
Ed Greshko writes:
The server with dhcp, httpd, named, and privoxy does not have NetworkManager installed. Both the WAN and the LAN ports are configured as static IPs.
You may want to try NetworkManager and wait-online. WAN links can take time to become active.
NetworkManager does nothing for me, except to suck in another 16 packages it depends on, that I don't need. The WAN link is a statically-routed IP traffic. It's up as soon as the network interface brings up the link.
Now that you re-read what you wrote I think misunderstood WAN to be a wireless connection which would require a time to bring up the link with the AP. If you mean WAN in the sense of a Wide Area Network, I wonder if there is a need for PPP negotiation.
I didn't say PPP either. My WAN is not some screwed up pseudo-Internet connection that uses ppp over some godforsaken transport.
Of course I didn't say you said PPP. Some areas of the world still use PPP and it works fine and dandy. I didn't want to assume anything since I don't know where you're located.
Hilarity ensues.
Any plans to bugzilla your issue? It should be fun to follow.
- -- If you can't laugh at yourself, others will gladly oblige.
Ed Greshko writes:
Hilarity ensues.
Any plans to bugzilla your issue? It should be fun to follow.
Well, I thought about this. For about 30 seconds.
I'm going to say something controversial now. It'll likely send some flames my way; and my ethics will get questioned. Maybe, I suppose, someday I'll care about that, but not today.
This is the very last thing that I'd like to do. Rather, I want systemd to fail. To go down, hard, rather than get some help in band-aiding its fall- out.
I'd like to see Red Hat start getting tickets from angry, paying customers, who, after upgrading to RHEL 7, end up half their servers come up deaf and mute. And a different subset of servers having a problem with every boot cycle.
I'm tired of dealing, first with the PrivateTmp fiasco, and now this, and who knows what's next. Hopefully, at least someone might understand why someone else will NOT want to help fixing this mess, but, if not, that's ok too.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/06/14 09:17, Sam Varshavchik wrote:
Ed Greshko writes:
Hilarity ensues.
Any plans to bugzilla your issue? It should be fun to follow.
Well, I thought about this. For about 30 seconds.
I'm going to say something controversial now. It'll likely send some flames my way; and my ethics will get questioned. Maybe, I suppose, someday I'll care about that, but not today.
This is the very last thing that I'd like to do. Rather, I want systemd to fail. To go down, hard, rather than get some help in band-aiding its fall-out.
I'd like to see Red Hat start getting tickets from angry, paying customers, who, after upgrading to RHEL 7, end up half their servers come up deaf and mute. And a different subset of servers having a problem with every boot cycle.
I'm tired of dealing, first with the PrivateTmp fiasco, and now this, and who knows what's next. Hopefully, at least someone might understand why someone else will NOT want to help fixing this mess, but, if not, that's ok too.
OK. Up to you. I suppose this thread can be archived to "rant". :-) :-)
- -- If you can't laugh at yourself, others will gladly oblige.
On 07/06/2014 03:17 AM, Sam Varshavchik wrote:
Ed Greshko writes:
Hilarity ensues.
Any plans to bugzilla your issue? It should be fun to follow.
Well, I thought about this. For about 30 seconds.
I'm going to say something controversial now. It'll likely send some flames my way; and my ethics will get questioned. Maybe, I suppose, someday I'll care about that, but not today.
This is the very last thing that I'd like to do. Rather, I want systemd to fail. To go down, hard, rather than get some help in band-aiding its fall- out.
I'd like to see Red Hat start getting tickets from angry, paying customers, who, after upgrading to RHEL 7, end up half their servers come up deaf and mute. And a different subset of servers having a problem with every boot cycle.
I'm tired of dealing, first with the PrivateTmp fiasco, and now this, and who knows what's next. Hopefully, at least someone might understand why someone else will NOT want to help fixing this mess, but, if not, that's ok too.
It is not at all controversial, it is U.B.E.R. L.A.M.E. from you! You like a true opportunist expect someone else to solve your specific problem.
poma
Ed Greshko writes:
I suppose the bottom line is I can't confirm your issue. I've not made any changes to the default systemd config or service files.
I guess I get to call dibs here. This issue is the one that bedeviled me with postfix (and nsd and ejabberd), and the solution, using network- online.target, is the same.
The issue is real.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/06/14 08:22, David Benfell wrote:
Ed Greshko writes:
I suppose the bottom line is I can't confirm your issue. I've not made any changes to the default systemd config or service files.
I guess I get to call dibs here. This issue is the one that bedeviled me with postfix (and nsd and ejabberd), and the solution, using network-online.target, is the same.
The issue is real.
Of course I didn't say it isn't real for others. Just saying I can't confirm it with my configuration and services I'm running....even if some of them are the same as others. Thus my leading with "FWIW".
So, of course, the best option to get this some attention would be to bugzilla it. Did you just fix your problem or did you also bugzilla it?
- -- If you can't laugh at yourself, others will gladly oblige
Ed Greshko writes:
Did you just fix your problem or did you also bugzilla it?
I was having trouble getting into bugzilla and set it aside. But I've revisited this: https://bugzilla.redhat.com/show_bug.cgi?id=1116539