Just had another boot that utterly failed because services that were supposed to start only after the network interfaces came up, didn't. They started too soon. Privoxy's logfile has the smoking gun:
2017-12-16 09:42:39.875 7f4acdaa6740 Fatal error: can't bind to 192.168.0.1:8000: Cannot assign requested address
This is despite that NetworkManager-wait-online was enabled and active. This is what everyone kept telling me was the only thing that needed to be done. Well, it was enabled:
[root@shorty system]# systemctl status NetworkManager-wait-online ● NetworkManager-wait-online.service - Network Manager Wait Online Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait- online.service; e Active: active (exited) since Sat 2017-12-16 09:42:39 EST; 12min ago Docs: man:nm-online(1) Process: 977 ExecStart=/usr/bin/nm-online -s -q --timeout=30 (code=exited, sta Main PID: 977 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 4915) CGroup: /system.slice/NetworkManager-wait-online.service
Dec 16 09:42:32 shorty.email-scan.com systemd[1]: Starting Network Manager Wait Dec 16 09:42:39 shorty.email-scan.com systemd[1]: Started Network Manager Wait O
It came up at 09:42:39. And, at 09:42:39 privoxy also came up. And privoxy still blew chunks because the primary network interface wasn't up yet.
Why is it so friggin difficult to get something this simple, this basic concept of starting things only after the network interfaces are up, working correctly, and reliably?
Oh yeah, I know. systemd.
On Sat, 2017-12-16 at 10:03 -0500, Sam Varshavchik wrote:
Just had another boot that utterly failed because services that were supposed to start only after the network interfaces came up, didn't. They started too soon. Privoxy's logfile has the smoking gun:
2017-12-16 09:42:39.875 7f4acdaa6740 Fatal error: can't bind to 192.168.0.1:8000: Cannot assign requested address
This is despite that NetworkManager-wait-online was enabled and active. This is what everyone kept telling me was the only thing that needed to be done. Well, it was enabled:
[root@shorty system]# systemctl status NetworkManager-wait-online ● NetworkManager-wait-online.service - Network Manager Wait Online Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait- online.service; e Active: active (exited) since Sat 2017-12-16 09:42:39 EST; 12min ago Docs: man:nm-online(1) Process: 977 ExecStart=/usr/bin/nm-online -s -q --timeout=30 (code=exited, sta Main PID: 977 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 4915) CGroup: /system.slice/NetworkManager-wait-online.service
Dec 16 09:42:32 shorty.email-scan.com systemd[1]: Starting Network Manager Wait Dec 16 09:42:39 shorty.email-scan.com systemd[1]: Started Network Manager Wait O
It came up at 09:42:39. And, at 09:42:39 privoxy also came up. And privoxy still blew chunks because the primary network interface wasn't up yet.
Why is it so friggin difficult to get something this simple, this basic concept of starting things only after the network interfaces are up, working correctly, and reliably?
Oh yeah, I know. systemd.
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org
Similar problem here Latest updates mean that my nfs mounts in /etc/fstab fail on boot because network is said to be running before DHCP has sorted itself out! This assumes I am interpreting journalctl output correctly.
Dec 16 13:26:33 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430793.6518] manager: startup complete Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: Started Network Manager Wait Online. Dec 16 13:26:33 hayling.jaa.org.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-wait-online comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: Reached target Network is Online. <<<<<<<<<<<<<<<<<<<<< Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: Mounting /global... Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: Starting Notify NFS peers of a restart... Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: Mounting /home... Dec 16 13:26:33 hayling.jaa.org.uk sm-notify[2847]: Version 2.2.1 starting Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: Started Notify NFS peers of a restart. Dec 16 13:26:33 hayling.jaa.org.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=rpc-statd-notify comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Dec 16 13:26:33 hayling.jaa.org.uk kernel: FS-Cache: Loaded Dec 16 13:26:33 hayling.jaa.org.uk kernel: FS-Cache: Netfs 'nfs' registered for caching Dec 16 13:26:33 hayling.jaa.org.uk kernel: Key type dns_resolver registered Dec 16 13:26:33 hayling.jaa.org.uk kernel: NFS: Registering the id_resolver key type Dec 16 13:26:33 hayling.jaa.org.uk kernel: Key type id_resolver registered Dec 16 13:26:33 hayling.jaa.org.uk kernel: Key type id_legacy registered Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: global.mount: Mount process exited, code=exited status=32 Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: Failed to mount /global. Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: Dependency failed for Remote File Systems. Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: remote-fs.target: Job remote-fs.target/start failed with result 'dependency'. Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: global.mount: Unit entered failed state. Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: home.mount: Mount process exited, code=exited status=32 Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: Failed to mount /home. Dec 16 13:26:33 hayling.jaa.org.uk systemd[1]: home.mount: Unit entered failed state.
1 second later??!!! ...
Dec 16 13:26:34 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430794.4301] device (enp0s31f6): state change: config -> ip-config (reason 'none', internal state 'managed') Dec 16 13:26:34 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430794.4306] dhcp4 (enp0s31f6): activation: beginning transaction (timeout in 45 seconds) Dec 16 13:26:34 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430794.4383] dhcp4 (enp0s31f6): dhclient started with pid 2886 Dec 16 13:26:34 hayling.jaa.org.uk dhclient[2886]: DHCPREQUEST on enp0s31f6 to 255.255.255.255 port 67 (xid=0x77026130) Dec 16 13:26:34 hayling.jaa.org.uk dhclient[2886]: DHCPACK from 148.197.29.5 (xid=0x77026130) Dec 16 13:26:34 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430794.4715] dhcp4 (enp0s31f6): address 148.197.29.202 Dec 16 13:26:34 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430794.4715] dhcp4 (enp0s31f6): plen 24 (255.255.255.0) Dec 16 13:26:34 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430794.4715] dhcp4 (enp0s31f6): gateway 148.197.29.254 Dec 16 13:26:34 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430794.4715] dhcp4 (enp0s31f6): lease time 86400 Dec 16 13:26:34 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430794.4715] dhcp4 (enp0s31f6): nameserver '148.197.29.5' Dec 16 13:26:34 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430794.4715] dhcp4 (enp0s31f6): nameserver '212.104.130.9' Dec 16 13:26:34 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430794.4715] dhcp4 (enp0s31f6): domain name 'jaa.org.uk' Dec 16 13:26:34 hayling.jaa.org.uk NetworkManager[1148]: <info> [1513430794.4715] dhcp4 (enp0s31f6): state changed unknown -> bound
Anyone else with the same problem/
John
On Sun, 17 Dec 2017 11:07:52 +0000 Dr J Austin wrote:
Anyone else with the same problem/
I gave up on getting NFS mounts to work a long time ago. I have a job I start with "at now" in rc.local that delays for a few seconds then does all the NFS mounts. I have other commands in there as well to delay the start of other services that need the network and often don't come up correctly in a normal boot.
P.S. The use of "at now" is required by yet more stupidity introduced by systemd which seems to believe it desperately needs to keep track of everything started by rc.local despite the fact that rc.local has always been a place to put ad-hoc random junk.
Dr J Austin writes:
On Sat, 2017-12-16 at 10:03 -0500, Sam Varshavchik wrote:
Why is it so friggin difficult to get something this simple, this basic concept of starting things only after the network interfaces are up,
working
correctly, and reliably?
Oh yeah, I know. systemd.
Similar problem here Latest updates mean that my nfs mounts in /etc/fstab fail on boot because network is said to be running before DHCP has sorted itself out! This assumes I am interpreting journalctl output correctly.
This utter, miserable fail of accomplishing the most simple, basic, elementary tasks when it comes to booting a server – waiting until it's network interfaces are set up before launching stuff that requires those network services – really ticked me off yesterday, to the point I spent a little bit of my time trying to do something about it.
It'll be a while before I can verify that my attempted fix actually works, because in my case these embarassing failures are very sporadic, and happen only once in a while. But in the meantime, heck, why not, I decided to toss the whole thing on github:
https://github.com/svarshavchik/unfrak-systemd-network
if this happens to fix someone else's boot issues, great. I think this will work with DHCP, provided that the DHCP-assigned IP addresses are static.
On Sat, Dec 16, 2017 at 10:03 AM, Sam Varshavchik mrsam@courier-mta.com wrote:
Just had another boot that utterly failed because services that were supposed to start only after the network interfaces came up, didn't. They started too soon. Privoxy's logfile has the smoking gun:
2017-12-16 09:42:39.875 7f4acdaa6740 Fatal error: can't bind to 192.168.0.1:8000: Cannot assign requested address
This is despite that NetworkManager-wait-online was enabled and active. This is what everyone kept telling me was the only thing that needed to be done. Well, it was enabled:
[root@shorty system]# systemctl status NetworkManager-wait-online ● NetworkManager-wait-online.service - Network Manager Wait Online Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; e Active: active (exited) since Sat 2017-12-16 09:42:39 EST; 12min ago Docs: man:nm-online(1) Process: 977 ExecStart=/usr/bin/nm-online -s -q --timeout=30 (code=exited, sta Main PID: 977 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 4915) CGroup: /system.slice/NetworkManager-wait-online.service
Dec 16 09:42:32 shorty.email-scan.com systemd[1]: Starting Network Manager Wait Dec 16 09:42:39 shorty.email-scan.com systemd[1]: Started Network Manager Wait O
It came up at 09:42:39. And, at 09:42:39 privoxy also came up. And privoxy still blew chunks because the primary network interface wasn't up yet.
Why is it so friggin difficult to get something this simple, this basic concept of starting things only after the network interfaces are up, working correctly, and reliably?
Oh yeah, I know. systemd.
I've just read "man nm-online" (possibly for the first time!) and "nm-online -s" doesn't seem to do what "NetworkManager-wait-online.service" is supposed to do, systemd-wise. It waits "for NetworkManager startup to complete, rather than waiting for network connectivity specifically." :(
Does privoxy start properly if you disable NM and its wait-online unit and use systemd-networkd and its wait-online unit?
Does privoxy start properly if you disable NM's wait-online unit and use a custom wait-online unit that waits until "/sys/class/net/<IF>/carrier" is "1"?
On Sun, 2017-12-17 at 08:20 -0500, Tom Horsley wrote:
On Sun, 17 Dec 2017 11:07:52 +0000 Dr J Austin wrote:
Anyone else with the same problem/
I gave up on getting NFS mounts to work a long time ago. I have a job I start with "at now" in rc.local that delays for a few seconds then does all the NFS mounts. I have other commands in there as well to delay the start of other services that need the network and often don't come up correctly in a normal boot.
P.S. The use of "at now" is required by yet more stupidity introduced by systemd which seems to believe it desperately needs to keep track of everything started by rc.local despite the fact that rc.local has always been a place to put ad-hoc random junk. _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org
I gave up using ldap/autofs some time ago and until the last update the _netdev option in fstab (or something else maybe) has been working perfectly.
148.197.29.5:/home /home nfs4 defaults,_netdev 0 0 148.197.29.5:/global /global nfs4 defaults,_netdev 0 0
On 12/17/2017 07:18 AM, Tom H wrote:
I've just read "man nm-online" (possibly for the first time!) and "nm-online -s" doesn't seem to do what "NetworkManager-wait-online.service" is supposed to do, systemd-wise. It waits "for NetworkManager startup to complete, rather than waiting for network connectivity specifically.":(
Does privoxy start properly if you disable NM and its wait-online unit and use systemd-networkd and its wait-online unit?
I think a more interesting question might be: Does privoxy start properly if you modify that line to read "ExecStart=/usr/bin/nm-online -q --timeout=30"?
On Sun, Dec 17, 2017 at 11:23 AM, Gordon Messmer gordon.messmer@gmail.com wrote:
On 12/17/2017 07:18 AM, Tom H wrote:
I've just read "man nm-online" (possibly for the first time!) and "nm-online -s" doesn't seem to do what "NetworkManager-wait-online.service" is supposed to do, systemd-wise. It waits "for NetworkManager startup to complete, rather than waiting for network connectivity specifically.":(
Does privoxy start properly if you disable NM and its wait-online unit and use systemd-networkd and its wait-online unit?
I think a more interesting question might be: Does privoxy start properly if you modify that line to read "ExecStart=/usr/bin/nm-online -q --timeout=30"?
Yes, it's the same thing as doing what I suggested next and you snipped out.
On Sun, Dec 17, 2017 at 8:41 AM, Dr J Austin ja@jaa.org.uk wrote:
On Sun, 2017-12-17 at 08:20 -0500, Tom Horsley wrote:
On Sun, 17 Dec 2017 11:07:52 +0000 Dr J Austin wrote:
Anyone else with the same problem/
I gave up on getting NFS mounts to work a long time ago. I have a job I start with "at now" in rc.local that delays for a few seconds then does all the NFS mounts. I have other commands in there as well to delay the start of other services that need the network and often don't come up correctly in a normal boot.
P.S. The use of "at now" is required by yet more stupidity introduced by systemd which seems to believe it desperately needs to keep track of everything started by rc.local despite the fact that rc.local has always been a place to put ad-hoc random junk. _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org
I gave up using ldap/autofs some time ago and until the last update the _netdev option in fstab (or something else maybe) has been working perfectly.
148.197.29.5:/home /home nfs4 defaults,_netdev 0 0 148.197.29.5:/global /global nfs4 defaults,_netdev 0 0
a. defaults + _netdev is confusing, it should be defaults by itself OR options b. instead of _netdev you probably want some combination of the following: noauto,x-systemd.automount
I always use that combination for /boot/efi, local data volumes in /srv, and for network shares.
Also, fstab should include a valid hostname rather than IP, I think it's trying to mount when the network interface alone is up but there still are not other NFS dependent services including dhcp, so if you include the name here, all of those thing will have to come up before the connection is even attempted.
My understanding of NetworkManager-wait-online is to make sure that network connections are disconnected before the network is taken down on a reboot/shutdown. I don't know what its role is during startup.
There's a lot more here: https://wiki.archlinux.org/index.php/NFS
On 12/17/2017 08:38 AM, Tom H wrote:
Yes, it's the same thing as doing what I suggested next and you snipped out.
Not exactly... I'd like to know if this commit is the one that broke the service:
https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/data/Netwo...
In order to determine that, someone who can reproduce the problem needs to revert that specific change on their system.
I can't reproduce the problem, so I can't test this. For me, NetworkManager-wait-online already works correctly.
I've asked you and others to provide more information about the network-online problem several times (for example https://www.spinics.net/linux/fedora/fedora-users/msg479604.html), but it seems like people would rather bitch about systemd than provide the information necessary to fix the problem. As far as that goes, this looks like a bug in NetworkManager, if anything, and not systemd.
On Sun, 17 Dec 2017 11:49:39 -0800 Gordon Messmer wrote:
Not exactly... I'd like to know if this commit is the one that broke the service:
https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/data/Netwo...
In order to determine that, someone who can reproduce the problem needs to revert that specific change on their system.
I did that: that do not solve the problem for me.
In my case some NFS mounts complain with either:
failed: No route to host
or
failed: Resource temporarily unavailable
but the mount finish by succeeding ~20s afterwards.
On another server the mounts failed forever.
Ex: With that declaration in /etc/fstab:
Y:/data2 /data2 nfs rw,bg,hard,tcp,intr 0 0
no _netdev but I don't think it is worth with NetworkManager.
## ~7 seconds, similar to the time spend to see the routes below Dec 17 21:32:51 X systemd[1]: Starting Network Manager Wait Online... Dec 17 21:32:59 X systemd[1]: Started Network Manager Wait Online.
Dec 17 21:32:59 X systemd[1]: Mounting /data2... Dec 17 21:33:19 X mount[996]: mount to NFS server 'Y' failed: Resource temporarily unavailable, retrying Dec 17 21:33:19 X systemd[1]: Mounted /data2.
with my "drop-in" routes-gw.conf those mounts are instantaneous:
Dec 17 17:32:09 X systemd[1]: Starting Network Manager Wait Online... ... Dec 17 17:32:16 X bash[735]: Network Manager Wait Online routes took 7 seconds Dec 17 17:32:45 X bash[735]: Network Manager Wait Online gateway took 29 seconds Dec 17 17:32:45 X systemd[1]: Started Network Manager Wait Online.
Dec 17 17:32:45 X systemd[1]: Mounting /data2... Dec 17 17:32:45 X systemd[1]: Mounted /data2.
I can give more details if you need.
Tom H writes:
Does privoxy start properly if you disable NM and its wait-online unit and use systemd-networkd and its wait-online unit?
Does privoxy start properly if you disable NM's wait-online unit and use a custom wait-online unit that waits until "/sys/class/net/<IF>/carrier" is "1"?
privoxy starts properly most of the time right now, already.
But I really don't understand why so much research is needed for this issue, by disabling random things, and then trying other random things. Either NetworkManager-wait-online actually waits until the network interfaces have their IP address set, or it doesn't. If it's supposed to do it, then it should be possible to isolate the issue without turning it off completely and switching to a completely different network configuration infrastructure. If it's not supposed to do it, then what exactly is it supposed to be doing, anyway?
On 17Dec2017 18:05, sam varshavchik mrsam@courier-mta.com wrote:
But I really don't understand why so much research is needed for this issue, by disabling random things, and then trying other random things. Either NetworkManager-wait-online actually waits until the network interfaces have their IP address set, or it doesn't. If it's supposed to do it, then it should be possible to isolate the issue without turning it off completely and switching to a completely different network configuration infrastructure. If it's not supposed to do it, then what exactly is it supposed to be doing, anyway?
Someone pointed out that it seems to wait for the network services to start, not for the IP assignments/allocation to complete.
As a scenario, consider dhclient. It typically tries to conduct an initial DHCP negotiation, but after a little while backgrounds itself (and continues to try).
What is your desired behaviour when DHCP service is not available? Have your system boot hang until it is? For a home LAN that might be tenable, _if_ you are always physically there to remedy things if necessary. (Of course, if my laptop needs to come up before I can easily remedy things I have a problem.)
But when DHCP comes from one's ISP (be that a real ISP or some LAN you don't personally manage)?
Just outlining why NetworkManager-wait-online may not be doing what people had hoped. Instead, it may be doing what is reasonable, and its completion doesn't imply what people are needing for their next stuff.
Myself, I start far more than I used to as services via my own account, and some of those services defer their startup until I have a default route, a fair proxy for having the network "up". They all monitor a flag and I have a little script to watch for a default route. Conversely, they also shutdown if that flag goes false.
I'm not trying to defend systemd here, just saying that (a) it isn't necessarily unreasonable for NetworkManager-wait-online to not imply the condition people thought and that (b) I personally take more things on myself, the better to be independent of systemd or whatever other quaint fashions my distro may follow.
Cheers, Cameron Simpson cs@cskk.id.au (formerly cs@zip.com.au)
On Sun, Dec 17, 2017 at 11:44 AM, Chris Murphy lists@colorremedies.com wrote:
On Sun, Dec 17, 2017 at 8:41 AM, Dr J Austin ja@jaa.org.uk wrote:
I gave up using ldap/autofs some time ago and until the last update the _netdev option in fstab (or something else maybe) has been working perfectly.
148.197.29.5:/home /home nfs4 defaults,_netdev 0 0 148.197.29.5:/global /global nfs4 defaults,_netdev 0 0
a. defaults + _netdev is confusing, it should be defaults by itself OR options
Confusing, possibly. But correct, even though redundant. But I'd agree that it should only be used if no other options are set.
b. instead of _netdev you probably want some combination of the following: noauto,x-systemd.automount
I always use that combination for /boot/efi, local data volumes in /srv, and for network shares.
Also, fstab should include a valid hostname rather than IP, I think it's trying to mount when the network interface alone is up but there still are not other NFS dependent services including dhcp, so if you include the name here, all of those thing will have to come up before the connection is even attempted.
"noauto,x-systemd.automount" would work but is it what the OP wants? He could've used autofs if he wanted an automount and didn't know about systemd's integrated automount feature.
My understanding of NetworkManager-wait-online is to make sure that network connections are disconnected before the network is taken down on a reboot/shutdown. I don't know what its role is during startup.
There's a lot more here: https://wiki.archlinux.org/index.php/NFS
It's surprising that the Arch wiki doesn't mention startup.
In the same way that the wait-online unit ensures that the nfs shares are unmounted before the network is taken down, it ensures that the network is up before the nfs shares are mounted.
On Sun, Dec 17, 2017 at 2:49 PM, Gordon Messmer gordon.messmer@gmail.com wrote:
On 12/17/2017 08:38 AM, Tom H wrote:
Yes, it's the same thing as doing what I suggested next and you snipped out.
Not exactly... I'd like to know if this commit is the one that broke the service:
https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/data/Netwo...
Oops. Thanks for the correction!
I only read the "-s" section of the man page and didn't think of "nm-online -q" instead of "nm-online -s -q" :(
Cameron Simpson writes:
On 17Dec2017 18:05, sam varshavchik mrsam@courier-mta.com wrote:
But I really don't understand why so much research is needed for this issue, by disabling random things, and then trying other random things. Either NetworkManager-wait-online actually waits until the network interfaces have their IP address set, or it doesn't. If it's supposed to do it, then it should be possible to isolate the issue without turning it off completely and switching to a completely different network configuration infrastructure. If it's not supposed to do it, then what exactly is it supposed to be doing, anyway?
Someone pointed out that it seems to wait for the network services to start, not for the IP assignments/allocation to complete.
As a scenario, consider dhclient. It typically tries to conduct an initial DHCP negotiation, but after a little while backgrounds itself (and continues to try).
What is your desired behaviour when DHCP service is not available?
During the process of trying to obtain or reauthorize a DHCP lease, you have to reach a point where a conclusion is made: yes, I have successfully acquired an IP address; or: no, I have not. Now if it's "I have not", you may very well go into background and keep trying. But at this point a statement can be made: this particular network interface is initialized, or initialization has failed.
In the old days (you know, when systems managed to boot properly, how I miss them), at that point the boot continues and let the chips fall where they may.
It may certainly possible that the current dhclient cannot be made to work this way. If so then, yes, this would be an issue. However that's completely besides the point. Here we're not talking DHCP, we're talking about STATICALLY ASSIGNED IP addresses. Fixed IP addresses. It is not rocket science, and it is not unreasonable to expect that it should be easy to establish when the network ports have been turned on and assigned those IP addresses, so everything that uses those IP addresses can now be started.
Can someone tell me if what I'm asking for is really too much, and impossible these days: don't start things at boot that depend on the network interfaces with static IP addresses until those interfaces are actually up.
When we had initiscripts, I forget which one it was, but there was one that read all the config files, and enabled those interfaces. And stuff that depended on the network being up ran after that. Simple. Easy. So, again: is it unreasonable to be able to start things that require the statically- assigned IP addresses after they actually are assigned to their network ports? Did something change, in the world we live in, where this is not possible any more? And what exactly did change, that made this a logically impossible, herculean task?
But when DHCP comes from one's ISP (be that a real ISP or some LAN you don't personally manage)?
Just outlining why NetworkManager-wait-online may not be doing what people had hoped.
Maybe, but, again it's completely irrelevant. We're not talking about DHCP. We're talking about fixed IP addresses. If NetworkManager – and if it is indeed NetworkManager that's totally fraking up here – cannot do such a simple, basic, elementary task as enabling ports with fixed IP addresses, at a well-defined point when the system boots, then, I guess we can reach the conclusion that NetworkManager is totally incapable of accomplishing basic networking administration tasks, and move on to consider ugly hacks like https://github.com/svarshavchik/unfrak-systemd-network to be the only way to actually reliably boot a server, in the brave new world of systemd and NetworkManager, and there wouldn't be any reason to waste any more breath on this nonsense.
Well, this is a fine way to start things on a Monday morning – this, and also reading another dumpster fire in my mailbox, a different four year-old bug which is pretty much the same thing – crap running when it's not supposed to be running during boot – getting reopened.
Allegedly, on or about 18 December 2017, Cameron Simpson sent:
NetworkManager-wait-online may not be doing what people had hoped.
I would have thought something with "wait online" in its name would actually wait for it to be on-line. To me, "on-line" means connected and operational. Starting up but not actually on-line, isn't on-line.
Sam Varshavchik writes:
a well-defined point when the system boots, then, I guess we can reach the conclusion that NetworkManager is totally incapable of accomplishing basic networking administration tasks, and move on to consider ugly hacks like https://github.com/svarshavchik/unfrak-systemd-network to be the only way to
And to close the loop on this dumpster fire, the evidence is in: NetworkManager-wait-online does not really wait until anything is online. Got solid proof, right here. I just installed and tested the unfrak-systemd- network script on that server of mine where things often don't get started properly. This script quickly tries to bind to the server's known IP addresses, logs the results, and tries every second until it succeeds, then E-mails me the report.
Well, on this particular boot, which succeeded, NetworkManagerWaitOnline.service alleged "Started" on 08:35:33:
Dec 18 08:35:27 shorty.email-scan.com systemd[1]: Starting Network Manager Wait Dec 18 08:35:33 shorty.email-scan.com systemd[1]: Started Network Manager Wait O
unfrak-systemd-network ran a second later:
Dec 18 08:35:33 shorty.email-scan.com systemd[1]: Starting Unfrak systemd networ Dec 18 08:35:35 shorty.email-scan.com systemd[1]: Started Unfrak systemd network
And the emailed report from the script landed in my mailbox:
Subject: systemd network initialization unfrak report Message-ID: courier.000000005A37C427.0000058D@www.courier-mta.com Date: Mon, 18 Dec 2017 08:35:35 -0500 --------------------------------------------------------------------------------
Time IP addresses ======== ================== 08:35:34 08:35:35 192.168.0.1
At 08:35:34 the server had no IP addresses, a full second until NetworkManager assured the rest of the system that it's "online". Finally, 192.168.0.1 came up a second later.
This service is keyed to run after NetworkManager-wait-line:
[Unit] Description=Unfrak systemd network startup After=NetworkManager-wait-online.service systemd-networkd-wait-online.service Before=network-online.target
Now that unfrak-systemd-network injects itself into the dependency chain, and delays it, looks like all the problematic services get delayed until the IP addresses are really there:
Dec 18 08:35:35 shorty.email-scan.com systemd[1]: Starting Privoxy Web Proxy Wit Dec 18 08:35:36 shorty.email-scan.com systemd[1]: Started Privoxy Web Proxy With
Dec 18 08:35:35 shorty.email-scan.com systemd[1]: Starting Courier SOCKS 5 proxy Dec 18 08:35:35 shorty.email-scan.com systemd[1]: Started Courier SOCKS 5 proxy.
It's a sad, sad state of affairs when one needs to resort to these kinds of hacks just to get a server booted correctly.
On Mon, 18 Dec 2017 07:19:02 -0500 Sam Varshavchik mrsam@courier-mta.com wrote:
When we had initiscripts, I forget which one it was, but there was one that read all the config files, and enabled those interfaces. And stuff that depended on the network being up ran after that. Simple. Easy. So, again: is it unreasonable to be able to start things that require the statically- assigned IP addresses after they actually are assigned to their network ports? Did something change, in the world we live in, where this is not possible any more? And what exactly did change, that made this a logically impossible, herculean task?
I think you are talking about the days before we had multi-threaded boot, and boot was deterministic. Boot went to multi-threaded, and so became non-deterministic, in order to shorten boot time. Maybe part of the solution is a kernel (or systemd) switch that says, "I don't care if boot takes 10 or 20 seconds (or a minute) longer, I want it to be deterministic." If that switch was set, then things would always run sequentially in a fixed order, unlike now.
On Mon, 2017-12-18 at 09:56 -0700, stan wrote:
On Mon, 18 Dec 2017 07:19:02 -0500 Sam Varshavchik mrsam@courier-mta.com wrote:
When we had initiscripts, I forget which one it was, but there was one that read all the config files, and enabled those interfaces. And stuff that depended on the network being up ran after that. Simple. Easy. So, again: is it unreasonable to be able to start things that require the statically- assigned IP addresses after they actually are assigned to their network ports? Did something change, in the world we live in, where this is not possible any more? And what exactly did change, that made this a logically impossible, herculean task?
I think you are talking about the days before we had multi-threaded boot, and boot was deterministic. Boot went to multi-threaded, and so became non-deterministic, in order to shorten boot time. Maybe part of the solution is a kernel (or systemd) switch that says, "I don't care if boot takes 10 or 20 seconds (or a minute) longer, I want it to be deterministic." If that switch was set, then things would always run sequentially in a fixed order, unlike now. _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org
Many thanks for the suggestions & feedback
AFAIK I have not changed anything and I cannot get the machine to fail again with the problem of DHCP running after the nfs mount attempts.
Yesterday I could not get it to boot cleanly in about 10 attempts!
Maybe it was a network problem, maybe a hardware problem?
Maybe some form of race condition with NetworkManager-wait-online and friends
I will bring a second F27 machine up to the same "update" state and see what happens
On Mon, 18 Dec 2017 09:56:03 -0700 stan wrote:
Boot went to multi-threaded, and so became non-deterministic, in order to shorten boot time.
And that has saved oh so much time compared to the time every sysadmin has spent fighting with it :-).
On 12/17/2017 01:17 PM, Francis.Montagnac@inria.fr wrote:
On Sun, 17 Dec 2017 11:49:39 -0800 Gordon Messmer wrote:
In order to determine that, someone who can reproduce the problem needs to revert that specific change on their system.
With that declaration in /etc/fstab: Y:/data2 /data2 nfs rw,bg,hard,tcp,intr 0 0
Dec 17 21:32:59 X systemd[1]: Mounting /data2... Dec 17 21:33:19 X mount[996]: mount to NFS server 'Y' failed: Resource temporarily unavailable, retrying Dec 17 21:33:19 X systemd[1]: Mounted /data2.
Can you reproduce that condition and then get the output of "systemctl status NetworkManager-wait-online"? That should confirm that the ExecStart process does not include the -s flag.
On 12/18/2017 05:52 AM, Sam Varshavchik wrote:
Time IP addresses ======== ================== 08:35:34 08:35:35 192.168.0.1
At 08:35:34 the server had no IP addresses
Well, it probably had 127.0.0.1, which brings into question what the complete state of the network was.
Could you arrange to execute "ip addr show | logger" in your unfrak script? That way we get all of the interfaces and all of the addresses regardless of family.
Could you also see if removing the "-s" flag from /usr/lib/systemd/system/NetworkManager-wait-online.service changes the behavior of the system?
Gordon Messmer writes:
On 12/18/2017 05:52 AM, Sam Varshavchik wrote:
Time IP addresses ======== ================== 08:35:34 08:35:35 192.168.0.1
At 08:35:34 the server had no IP addresses
Well, it probably had 127.0.0.1, which brings into question what the complete state of the network was.
I'm pretty sure it does. My script only checks the IP addresses it knows about. It doesn't check loopback.
Could you arrange to execute "ip addr show | logger" in your unfrak script? That way we get all of the interfaces and all of the addresses regardless of family.
Could you also see if removing the "-s" flag from /usr/lib/systemd/system/NetworkManager-wait-online.service changes the behavior of the system?
I'll do this at the first convenient opportunity.
On Mon, 2017-12-18 at 19:24 -0500, Sam Varshavchik wrote:
Gordon Messmer writes:
On 12/18/2017 05:52 AM, Sam Varshavchik wrote:
Time IP addresses ======== ================== 08:35:34 08:35:35 192.168.0.1
At 08:35:34 the server had no IP addresses
Well, it probably had 127.0.0.1, which brings into question what the complete state of the network was.
I'm pretty sure it does. My script only checks the IP addresses it knows about. It doesn't check loopback.
Could you arrange to execute "ip addr show | logger" in your unfrak script? That way we get all of the interfaces and all of the addresses regardless of family.
Could you also see if removing the "-s" flag from /usr/lib/systemd/system/NetworkManager-wait-online.service changes the behavior of the system?
I'll do this at the first convenient opportunity.
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org
I have now tried to do some sensible testing with the -s option in /usr/lib/systemd/system/NetworkManager-wait-online.service
I have tested in groups of 10 reboots in the order shown A fail indicating that the NFS mount has failed with -s 2 fails in 10 no -s 0 fails in 10 with -s 3 fails in 10 no -s 0 fails in 10 no -s 0 fails in 10 with -s 4 fails in 10
I have not seen a single failure with the -s removed Hence my immediate problem appears to be solved!
Many thanks
Gordon Messmer writes:
On 12/18/2017 05:52 AM, Sam Varshavchik wrote:
Time IP addresses ======== ================== 08:35:34 08:35:35 192.168.0.1
At 08:35:34 the server had no IP addresses
Well, it probably had 127.0.0.1, which brings into question what the complete state of the network was.
Could you arrange to execute "ip addr show | logger" in your unfrak script? That way we get all of the interfaces and all of the addresses regardless of family.
Could you also see if removing the "-s" flag from /usr/lib/systemd/system/NetworkManager-wait-online.service changes the behavior of the system?
Running just "ip addr show | logger" was not conclusive. Looks like the overhead of doing so delays things long enough so even the first time this actually runs all the network interfaces have their IP addresses already assigned.
Removing the -s option from actually makes things worse. The script has to wait noticably longer before all IP addresses are assigned:
Subject: systemd network initialization unfrak report
Time IP addresses ======== ================== 06:52:56 06:52:57 06:52:58 06:52:59 192.168.0.1
Usually it's 1 or 2 seconds. Without the -s option it's 4-5 seconds.
This seems consistent with the description of what the -s option does, from the man page. The way I parse its man page entry is that the -s option actually waits for more things to happen, before it's done. So removing that option makes NetworkManager's definition of when things are online occur much earlier.
This is confirmed by running "ip addr show | logger" without -s option. This produces some useful results. This time, the first time "ip addr show" runs it's early enough so that the network is not fully initialized.
syslog shows two runs of "ip addr show", showing no IP addresses configured on one of the two network interfaces. The 2nd network interface already has its IP addresses assigned. This is followed by some messages from NetworkManager, then another run of "ip addr show", showing all network interfaces with assigned IP addresses.
Both eno1 and eno2 have statically assigned IP addresses in /etc/sysconfig/network-scripts.
Dec 19 06:57:09 shorty systemd[1]: Starting Unfrak systemd network startup... Dec 19 06:57:09 shorty systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive. Dec 19 06:57:09 shorty root[1400]: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 Dec 19 06:57:09 shorty root[1400]: link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 Dec 19 06:57:09 shorty root[1400]: inet 127.0.0.1/8 scope host lo Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet6 ::1/128 scope host Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: 2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 Dec 19 06:57:09 shorty root[1400]: link/ether 0c:c4:7a:32:c1:82 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:09 shorty root[1400]: 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 Dec 19 06:57:09 shorty root[1400]: link/ether 0c:c4:7a:32:c1:83 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:09 shorty root[1400]: inet 68.166.206.83/29 brd 68.166.206.87 scope global eno2 Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet 68.166.206.82/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet 68.166.206.84/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet 68.166.206.85/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet 68.166.206.86/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet6 fe80::ec4:7aff:fe32:c183/64 scope link tentative Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 Dec 19 06:57:10 shorty root[1403]: link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 Dec 19 06:57:10 shorty root[1403]: inet 127.0.0.1/8 scope host lo Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet6 ::1/128 scope host Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: 2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 Dec 19 06:57:10 shorty root[1403]: link/ether 0c:c4:7a:32:c1:82 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:10 shorty root[1403]: 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 Dec 19 06:57:10 shorty root[1403]: link/ether 0c:c4:7a:32:c1:83 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:10 shorty root[1403]: inet 68.166.206.83/29 brd 68.166.206.87 scope global eno2 Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet 68.166.206.82/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet 68.166.206.84/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet 68.166.206.85/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet 68.166.206.86/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet6 fe80::ec4:7aff:fe32:c183/64 scope link tentative Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty systemd-networkd[955]: eno2: Gained IPv6LL Dec 19 06:57:10 shorty NetworkManager[956]: <info> [1513684630.6860] manager: startup complete Dec 19 06:57:11 shorty kernel: e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Dec 19 06:57:11 shorty kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4385] device (eno1): link connected Dec 19 06:57:11 shorty systemd-networkd[955]: eno1: Gained carrier Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4390] device (eno1): state change: unavailable -> disconnected (reason 'carrier-changed', internal state 'managed') Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4396] policy: auto-activating connection 'lan0' Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4405] device (eno1): Activation: starting connection 'lan0' (d1a1ee90-f006-43bb-9cbf-175ad32f6565) Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4406] device (eno1): state change: disconnected -> prepare (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4410] device (eno1): state change: prepare -> config (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4436] device (eno1): state change: config -> ip-config (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty named[1035]: listening on IPv4 interface eno1, 192.168.0.1#53 Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4464] device (eno1): state change: ip-config -> ip-check (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty nm-dispatcher[990]: req:5 'pre-up' [eno1]: new request (1 scripts) Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4509] device (eno1): state change: ip-check -> secondaries (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4511] device (eno1): state change: secondaries -> activated (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty root[1415]: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 Dec 19 06:57:11 shorty root[1415]: link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 Dec 19 06:57:11 shorty root[1415]: inet 127.0.0.1/8 scope host lo Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet6 ::1/128 scope host Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 Dec 19 06:57:11 shorty root[1415]: link/ether 0c:c4:7a:32:c1:82 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:11 shorty root[1415]: inet 192.168.0.1/24 brd 192.168.0.255 scope global eno1 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet6 fe80::ec4:7aff:fe32:c182/64 scope link tentative Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 Dec 19 06:57:11 shorty root[1415]: link/ether 0c:c4:7a:32:c1:83 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:11 shorty root[1415]: inet 68.166.206.83/29 brd 68.166.206.87 scope global eno2 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet 68.166.206.82/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.5589] device (eno1): Activation: successful, device activated. Dec 19 06:57:11 shorty root[1415]: inet 68.166.206.84/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet 68.166.206.85/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet 68.166.206.86/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet6 fe80::ec4:7aff:fe32:c183/64 scope link Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty nm-dispatcher[990]: req:6 'up' [eno1]: new request (7 scripts) Dec 19 06:57:11 shorty systemd[1]: Started Unfrak systemd network startup.
Note that "ip addr show" ran as part of the script that has a dependency on nm-line.
After reversing all changes, putting the -s option back in, and not running "ip addr", repeated boots shows the status quo being restored. The presence of the -s option reduces the additional time that the script needs to wait for all IP addresses to be assigned to 1 or 2 seconds. But it still has to wait, since nm-online -s -q returns too early.
On 12/19/17 19:17, Dr J Austin wrote:
On Mon, 2017-12-18 at 19:24 -0500, Sam Varshavchik wrote:
Gordon Messmer writes:
On 12/18/2017 05:52 AM, Sam Varshavchik wrote:
Time IP addresses ======== ================== 08:35:34 08:35:35 192.168.0.1
At 08:35:34 the server had no IP addresses
Well, it probably had 127.0.0.1, which brings into question what the complete state of the network was.
I'm pretty sure it does. My script only checks the IP addresses it knows about. It doesn't check loopback.
Could you arrange to execute "ip addr show | logger" in your unfrak script? That way we get all of the interfaces and all of the addresses regardless of family.
Could you also see if removing the "-s" flag from /usr/lib/systemd/system/NetworkManager-wait-online.service changes the behavior of the system?
I'll do this at the first convenient opportunity.
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org
I have now tried to do some sensible testing with the -s option in /usr/lib/systemd/system/NetworkManager-wait-online.service
I have tested in groups of 10 reboots in the order shown A fail indicating that the NFS mount has failed with -s 2 fails in 10 no -s 0 fails in 10 with -s 3 fails in 10 no -s 0 fails in 10 no -s 0 fails in 10 with -s 4 fails in 10
I have not seen a single failure with the -s removed Hence my immediate problem appears to be solved!
Many thanks
FWIW, are you aware that you shouldn't make changes to /usr/lib/systemd/system/* files? These can be overwritten up updates. If you want to make changes you should created a file with the same name in /etc/systemd/system.
On Tue, 19 Dec 2017 20:16:51 +0800 Ed Greshko wrote:
FWIW, are you aware that you shouldn't make changes to /usr/lib/systemd/system/* files? These can be overwritten up updates. If you want to make changes you should created a file with the same name in /etc/systemd/system.
And when you do that, and they utterly redesign the original service file, you spend weeks trying to figure out why the service no longer works at all because you forgot you made the copy :-).
On Tue, 19 Dec 2017 07:25:46 -0500 Tom Horsley wrote:
On Tue, 19 Dec 2017 20:16:51 +0800
FWIW, are you aware that you shouldn't make changes to /usr/lib/systemd/system/* files? These can be overwritten up updates.
Yes yes, but we are debugging.
If you want to make changes you should created a file with the same name in /etc/systemd/system.
There is now (at least since F21) a better way. search for drop-in in the man of systemd.unit, see below.
And when you do that, and they utterly redesign the original service file, you spend weeks trying to figure out why the service no longer works at all because you forgot you made the copy :-).
Except if you look at the output of "systemctl cat X.service" that shows clearly what you did. Example:
# /usr/lib/systemd/system/X.service ...
# /etc/systemd/system/X.service.d/Y.conf Environment="OPTIONS=..."
This is far better and easier to maintain than patching /etc/sysconfig/X config files that are protected against updates by RPM.
On Tue, 2017-12-19 at 07:16 -0500, Sam Varshavchik wrote:
Gordon Messmer writes:
On 12/18/2017 05:52 AM, Sam Varshavchik wrote:
Time IP addresses ======== ================== 08:35:34 08:35:35 192.168.0.1
At 08:35:34 the server had no IP addresses
Well, it probably had 127.0.0.1, which brings into question what the complete state of the network was.
Could you arrange to execute "ip addr show | logger" in your unfrak script? That way we get all of the interfaces and all of the addresses regardless of family.
Could you also see if removing the "-s" flag from /usr/lib/systemd/system/NetworkManager-wait-online.service changes the behavior of the system?
Running just "ip addr show | logger" was not conclusive. Looks like the overhead of doing so delays things long enough so even the first time this actually runs all the network interfaces have their IP addresses already assigned.
Removing the -s option from actually makes things worse. The script has to wait noticably longer before all IP addresses are assigned:
Subject: systemd network initialization unfrak report
Time IP addresses ======== ================== 06:52:56 06:52:57 06:52:58 06:52:59 192.168.0.1
Usually it's 1 or 2 seconds. Without the -s option it's 4-5 seconds.
This seems consistent with the description of what the -s option does, from the man page. The way I parse its man page entry is that the -s option actually waits for more things to happen, before it's done. So removing that option makes NetworkManager's definition of when things are online occur much earlier.
This is confirmed by running "ip addr show | logger" without -s option. This produces some useful results. This time, the first time "ip addr show" runs it's early enough so that the network is not fully initialized.
syslog shows two runs of "ip addr show", showing no IP addresses configured on one of the two network interfaces. The 2nd network interface already has its IP addresses assigned. This is followed by some messages from NetworkManager, then another run of "ip addr show", showing all network interfaces with assigned IP addresses.
Both eno1 and eno2 have statically assigned IP addresses in /etc/sysconfig/network-scripts.
Dec 19 06:57:09 shorty systemd[1]: Starting Unfrak systemd network startup... Dec 19 06:57:09 shorty systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive. Dec 19 06:57:09 shorty root[1400]: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 Dec 19 06:57:09 shorty root[1400]: link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 Dec 19 06:57:09 shorty root[1400]: inet 127.0.0.1/8 scope host lo Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet6 ::1/128 scope host Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: 2: eno1: <NO- CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 Dec 19 06:57:09 shorty root[1400]: link/ether 0c:c4:7a:32:c1:82 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:09 shorty root[1400]: 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 Dec 19 06:57:09 shorty root[1400]: link/ether 0c:c4:7a:32:c1:83 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:09 shorty root[1400]: inet 68.166.206.83/29 brd 68.166.206.87 scope global eno2 Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet 68.166.206.82/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet 68.166.206.84/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet 68.166.206.85/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet 68.166.206.86/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:09 shorty root[1400]: inet6 fe80::ec4:7aff:fe32:c183/64 scope link tentative Dec 19 06:57:09 shorty root[1400]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 Dec 19 06:57:10 shorty root[1403]: link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 Dec 19 06:57:10 shorty root[1403]: inet 127.0.0.1/8 scope host lo Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet6 ::1/128 scope host Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: 2: eno1: <NO- CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 Dec 19 06:57:10 shorty root[1403]: link/ether 0c:c4:7a:32:c1:82 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:10 shorty root[1403]: 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 Dec 19 06:57:10 shorty root[1403]: link/ether 0c:c4:7a:32:c1:83 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:10 shorty root[1403]: inet 68.166.206.83/29 brd 68.166.206.87 scope global eno2 Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet 68.166.206.82/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet 68.166.206.84/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet 68.166.206.85/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet 68.166.206.86/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty root[1403]: inet6 fe80::ec4:7aff:fe32:c183/64 scope link tentative Dec 19 06:57:10 shorty root[1403]: valid_lft forever preferred_lft forever Dec 19 06:57:10 shorty systemd-networkd[955]: eno2: Gained IPv6LL Dec 19 06:57:10 shorty NetworkManager[956]: <info> [1513684630.6860] manager: startup complete Dec 19 06:57:11 shorty kernel: e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Dec 19 06:57:11 shorty kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4385] device (eno1): link connected Dec 19 06:57:11 shorty systemd-networkd[955]: eno1: Gained carrier Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4390] device (eno1): state change: unavailable -> disconnected (reason 'carrier-changed', internal state 'managed') Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4396] policy: auto-activating connection 'lan0' Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4405] device (eno1): Activation: starting connection 'lan0' (d1a1ee90-f006- 43bb-9cbf-175ad32f6565) Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4406] device (eno1): state change: disconnected -> prepare (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4410] device (eno1): state change: prepare -> config (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4436] device (eno1): state change: config -> ip-config (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty named[1035]: listening on IPv4 interface eno1, 192.168.0.1#53 Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4464] device (eno1): state change: ip-config -> ip-check (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty nm-dispatcher[990]: req:5 'pre-up' [eno1]: new request (1 scripts) Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4509] device (eno1): state change: ip-check -> secondaries (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4511] device (eno1): state change: secondaries -> activated (reason 'none', internal state 'managed') Dec 19 06:57:11 shorty root[1415]: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 Dec 19 06:57:11 shorty root[1415]: link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 Dec 19 06:57:11 shorty root[1415]: inet 127.0.0.1/8 scope host lo Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet6 ::1/128 scope host Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 Dec 19 06:57:11 shorty root[1415]: link/ether 0c:c4:7a:32:c1:82 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:11 shorty root[1415]: inet 192.168.0.1/24 brd 192.168.0.255 scope global eno1 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet6 fe80::ec4:7aff:fe32:c182/64 scope link tentative Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 Dec 19 06:57:11 shorty root[1415]: link/ether 0c:c4:7a:32:c1:83 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:11 shorty root[1415]: inet 68.166.206.83/29 brd 68.166.206.87 scope global eno2 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet 68.166.206.82/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.5589] device (eno1): Activation: successful, device activated. Dec 19 06:57:11 shorty root[1415]: inet 68.166.206.84/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet 68.166.206.85/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet 68.166.206.86/29 brd 68.166.206.87 scope global secondary eno2 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet6 fe80::ec4:7aff:fe32:c183/64 scope link Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty nm-dispatcher[990]: req:6 'up' [eno1]: new request (7 scripts) Dec 19 06:57:11 shorty systemd[1]: Started Unfrak systemd network startup.
Note that "ip addr show" ran as part of the script that has a dependency on nm-line.
After reversing all changes, putting the -s option back in, and not running "ip addr", repeated boots shows the status quo being restored. The presence of the -s option reduces the additional time that the script needs to wait for all IP addresses to be assigned to 1 or 2 seconds. But it still has to wait, since nm-online -s -q returns too early.
One question that has been bugging me for some time now: I wonder if nm-online checks for IP-addressing only when the "require ipv4 addressing for this connection to complete" option nm-connection- editor (which I guess results in IPV4_FAILURE_FATAL=yes being set in the config file for the interface) is set. Do you have the IPV4_FAILURE_FATAL set in /etc/sysconfig/network-scripts/ifcfg-xxx file? If I find some time, I will play around with it to see if this solves the problems with a switch port that takes time to go into forwarding state are solved with this option.... I may have a look at the sources of nm-online to see if it even looks at this parameter.
BR, Louis
On Tue, 2017-12-19 at 07:16 -0500, Sam Varshavchik wrote:
Gordon Messmer writes:
On 12/18/2017 05:52 AM, Sam Varshavchik wrote:
Time IP addresses ======== ================== 08:35:34 08:35:35 192.168.0.1
At 08:35:34 the server had no IP addresses
Well, it probably had 127.0.0.1, which brings into question what the complete state of the network was.
Could you arrange to execute "ip addr show | logger" in your unfrak script? That way we get all of the interfaces and all of the addresses regardless of family.
Could you also see if removing the "-s" flag from /usr/lib/systemd/system/NetworkManager-wait-online.service changes the behavior of the system?
Running just "ip addr show | logger" was not conclusive. Looks like the overhead of doing so delays things long enough so even the first time this actually runs all the network interfaces have their IP addresses already assigned.
Removing the -s option from actually makes things worse. The script has to wait noticably longer before all IP addresses are assigned:
Subject: systemd network initialization unfrak report
Time IP addresses ======== ================== 06:52:56 06:52:57 06:52:58 06:52:59 192.168.0.1
Usually it's 1 or 2 seconds. Without the -s option it's 4-5 seconds.
This seems consistent with the description of what the -s option does, from the man page. The way I parse its man page entry is that the -s option actually waits for more things to happen, before it's done. So removing that option makes NetworkManager's definition of when things are online occur much earlier.
This is confirmed by running "ip addr show | logger" without -s option. This produces some useful results. This time, the first time "ip addr show" runs it's early enough so that the network is not fully initialized.
syslog shows two runs of "ip addr show", showing no IP addresses configured on one of the two network interfaces. The 2nd network interface already has its IP addresses assigned. This is followed by some messages from NetworkManager, then another run of "ip addr show", showing all network interfaces with assigned IP addresses.
emd[1]: Started Unfrak systemd network startup.
Note that "ip addr show" ran as part of the script that has a dependency on nm-line.
After reversing all changes, putting the -s option back in, and not running "ip addr", repeated boots shows the status quo being restored. The presence of the -s option reduces the additional time that the script needs to wait for all IP addresses to be assigned to 1 or 2 seconds. But it still has to wait, since nm-online -s -q returns too early.
I have read the nm-online man page about 10 times and I am still not clear what it is telling me.
If your interpretation is correct I do not understand how removing the -s option solves my NSF mount problem.
Aside: I have confirmed on a second machine that with the -s option present sometimes the NFS mount fails.
First 10 boots no mount failures aha I thought! Second 10 boots 2 mount failures
Louis Lagendijk writes:
One question that has been bugging me for some time now: I wonder if nm-online checks for IP-addressing only when the "require ipv4 addressing for this connection to complete" option nm-connection- editor (which I guess results in IPV4_FAILURE_FATAL=yes being set in the config file for the interface) is set. Do you have the IPV4_FAILURE_FATAL set in /etc/sysconfig/network-scripts/ifcfg-xxx file?
Both ifcfg files have
IPV4_FAILURE_FATAL=no
explicitly set.
Running diff on them shows pretty much the only differences you would expect: NAME, UUID, HWADDR, IPADDR*, PREFIX*, GATEWAY, and DNS* are different. In terms of everything else, the config is the same.
Dr J Austin writes:
I have read the nm-online man page about 10 times and I am still not clear what it is telling me.
The way I parse it, without -s it waits until at least one network connection is present. With the option, it should wait until all connections are up. The man page starts by saying
"When run, nm-online waits until NetworkManager reports an active connection, or specified timeout expires."
This seems fairly clear. Then, -s option is described thusly:
"Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically. Startup is considered complete once NetworkManager has activated (or attempted to activate) every auto-activate connection which is available given the current network state."
The "every auto-activate connection" means to me: every network connection.
If your interpretation is correct I do not understand how removing the -s option solves my NSF mount problem.
I don't know. I can only report what I see.
On 12/19/2017 11:27 AM, Sam Varshavchik wrote:
"Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically. Startup is considered complete once NetworkManager has activated (or attempted to activate) every auto-activate connection which is available given the current network state."
The "every auto-activate connection" means to me: every network connection.
Unless, of course, you have a potential network connection that isn't started at boot. For most of us, it's a distinction without a difference, but there are cases where it may matter.
Sam,
Here's my understanding of the situation:
* a network interface can either be active/started -- i.e., is in the "up" state, without necessarily having a network address -- or connected -- i.e., having an address; and * a system can have multiple network interfaces.
I interpret "-s" to mean "all interfaces are active but do not necessarily have an address or a default route". This means that NM will return success once each interface is activated, but does not actually require that the system can reach the outside world.
Without "-s" it means "there is at least one interface with an address and default route".
The problem with the former is that it can return before your system can reach the outside world (e.g., interface is up but doesn't have a DHCP-assigned address). The problem with the latter is that if your system has multiple interfaces, as soon as *one* of them has an address and a route, NM says all is well and continues, *even if that interface can't reach the outside world*. I ran into the latter when my Ceton TV capture card -- which used a virtual network interface -- would come up and NM would say, "this 192.168.200.1 device has an address and a route, I'm done!" and continue along before my actual network interface got an address. This caused issues with various services, and -- if you check the mythtv-users archives -- you'll see that the systemd people's response was "working as intended, that's a bug in mythtv and the other pieces of software which don't adapt to network interfaces which appear and disappear randomly."
-justin
On Tue, Dec 19, 2017 at 2:27 PM, Sam Varshavchik mrsam@courier-mta.com wrote:
Dr J Austin writes:
I have read the nm-online man page about 10 times and I am still not
clear what it is telling me.
The way I parse it, without -s it waits until at least one network connection is present. With the option, it should wait until all connections are up. The man page starts by saying
"When run, nm-online waits until NetworkManager reports an active connection, or specified timeout expires."
This seems fairly clear. Then, -s option is described thusly:
"Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically. Startup is considered complete once NetworkManager has activated (or attempted to activate) every auto-activate connection which is available given the current network state."
The "every auto-activate connection" means to me: every network connection.
If your interpretation is correct I do not understand how removing the -s
option solves my NSF mount problem.
I don't know. I can only report what I see.
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org
On Tue, 2017-12-19 at 14:14 -0500, Sam Varshavchik wrote:
Louis Lagendijk writes:
One question that has been bugging me for some time now: I wonder if nm-online checks for IP-addressing only when the "require ipv4 addressing for this connection to complete" option nm-connection- editor (which I guess results in IPV4_FAILURE_FATAL=yes being set in the config file for the interface) is set. Do you have the IPV4_FAILURE_FATAL set in /etc/sysconfig/network-scripts/ifcfg-xxx file?
Both ifcfg files have
IPV4_FAILURE_FATAL=no
explicitly set.
Running diff on them shows pretty much the only differences you would expect: NAME, UUID, HWADDR, IPADDR*, PREFIX*, GATEWAY, and DNS* are different. In terms of everything else, the config is the samen
what happens if you change this to yes? Does nm-online then wait for the connection to be up?
BR, Louis
Joe Zeff writes:
On 12/19/2017 11:27 AM, Sam Varshavchik wrote:
"Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically. Startup is considered complete once NetworkManager has activated (or attempted to activate) every auto-activate connection which is available given the current network state."
The "every auto-activate connection" means to me: every network connection.
Unless, of course, you have a potential network connection that isn't started at boot. For most of us, it's a distinction without a difference, but there are cases where it may matter.
I just checked, and both of my ports have ONBOOT=yes. But, given this I agree, as documented this would excluding all interfaces that do not have ONBOOT=yes set. Still, nm-online proclaims "Mission Accomplished!" before both of them are actually up.
Justin Moore writes:
I interpret "-s" to mean "all interfaces are active but do not necessarily have an address or a default route". This means that NM will return success
What does it mean for an interface that has a static IP address explicitly specified in its ifcfg file to be "active", but not have that IP address?
The problem with the former is that it can return before your system can reach the outside world (e.g., interface is up but doesn't have a DHCP-
The interface in question has a static IP address. DHCP is not in the picture.
But an argument can be made that even if a network interface is configured via DHCP, then it's not "active" until it succeeds in acquiring an IP address. But that's besides the point, because these are simple, garden variety, static IP addresses. Can be any simpler than that.
Furthermore, I distinctly recall incidents where something got fubared with my DHCP server, and I had to drink a cup of coffee before other servers finished booting.
It's been my experience that NetworkManager most definitely puts the brakes on booting the server if dhclient can't get an IP address.
So, given that it throws a tantrum if dhclient can't get an IP address, I'm quite puzzled that it also just blows past an network port with a static IP address, but before it is fully configured with that IP address. This does not compute.
assigned address). The problem with the latter is that if your system has multiple interfaces, as soon as *one* of them has an address and a route, NM says all is well and continues, *even if that interface can't reach the
That's not how the nm-online man page describes the -s option. If that's not what the -s option does, then I have no idea what it's supposed to be doing.
along before my actual network interface got an address. This caused issues with various services, and -- if you check the mythtv-users archives -- you'll see that the systemd people's response was "working as intended, that's a bug in mythtv and the other pieces of software which don't adapt to network interfaces which appear and disappear randomly."
Well, I wouldn't really expect any less flippant or arrogant response there, this does not surprise me. But that's not important.
What is important, is that it's an inescapable conclusion that the only workable solution for this mess is something on the order of my script that routes around the damage, and beats systemd in submission. It's horribly ugly, I'm embarassed that I actually had write such a clunker, but it seems to actually makes network-online.target do what it's supposed to be doing in the first place, but is apparently too complicated for either systemd, or NetworkManager, to do correctly on their own.
So be it. Life's too short.
Louis Lagendijk writes:
Running diff on them shows pretty much the only differences you would expect: NAME, UUID, HWADDR, IPADDR*, PREFIX*, GATEWAY, and DNS* are different. In terms of everything else, the config is the samen
what happens if you change this to yes? Does nm-online then wait for the connection to be up?
Well, I could, perhaps, test that.
However, /usr/share/doc/initscripts/sysconfig.txt claims that the IPV4_FAILURE_FATAL setting is used only with BOOTPROTO=dhcp, and these interfaces are all static IP addresses, so I wouldn't expect that any difference would result from this.
Still, it's possible that NM hijacked something that's documented to be used only with DHCP, and recycled it for something completely unrelated. It's possible, I guess. But, I've already screwed around with this for much, much longer than I really wanted. There's definitely a breakdown somewhere, with either systemd or networkmanager, when it comes to deliver what network- online.target promised to deliver. If someone who's more knowledgeable with these packages can figure out the fail, and a tentative fix is proposed, I'll be interested in poking at it with a stick, try to test it, and see if it works.
But it looks to me like this simply cannot be made to work, in the current state of the world and changing random options in random configuration files will not be very productive. For now, the most productive path forward is to use a hacky workaround script that does make it possible to actually start services that require a working IP address only after an IP address is, well, actually working.
Thanks, Sam, that looks like very useful information. The logs you posted indicate that one interface, eno1, had no link when "ip addr show" ran, after NetworkManager reported itself online. This seems consistent with nm-online's man page which indicates that startup is complete when all connections are available "given the current network state.
The old "network" service would simply set the interface state to "up" regardless of whether or not there was a link, and further it had a LINKDELAY setting to ensure that the system would pause some fixed time (the admin's best guess, I suppose) before it continued.
NetworkManager, on the other hand, reacts to carrier changes and until two weeks ago, it had a hard-coded global timeout of 5 seconds for link availability on startup. A recent change has, as best I can tell, added a flag to address this specific problem:
https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=2becc0...
See the list of bugs in the commit message...
That's not entirely helpful, because that commit doesn't seem to have made it in to Fedora's package yet. It should be in 1.10.2, whenever that makes it way through. I don't see such an update pending in bodhi, so filing an RFE in bugzilla might be helpful...
On 12/19/2017 04:16 AM, Sam Varshavchik wrote:
Dec 19 06:57:09 shorty root[1400]: 2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 Dec 19 06:57:09 shorty root[1400]: link/ether 0c:c4:7a:32:c1:82 brd ff:ff:ff:ff:ff:ff
...
Dec 19 06:57:10 shorty root[1403]: 2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 Dec 19 06:57:10 shorty root[1403]: link/ether 0c:c4:7a:32:c1:82 brd ff:ff:ff:ff:ff:ff
...
Dec 19 06:57:10 shorty NetworkManager[956]: <info> [1513684630.6860] manager: startup complete Dec 19 06:57:11 shorty kernel: e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Dec 19 06:57:11 shorty kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4385] device (eno1): link connected Dec 19 06:57:11 shorty systemd-networkd[955]: eno1: Gained carrier Dec 19 06:57:11 shorty NetworkManager[956]: <info> [1513684631.4390] device (eno1): state change: unavailable -> disconnected (reason 'carrier-changed', internal state 'managed')
...
Dec 19 06:57:11 shorty root[1415]: 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 Dec 19 06:57:11 shorty root[1415]: link/ether 0c:c4:7a:32:c1:82 brd ff:ff:ff:ff:ff:ff Dec 19 06:57:11 shorty root[1415]: inet 192.168.0.1/24 brd 192.168.0.255 scope global eno1 Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever Dec 19 06:57:11 shorty root[1415]: inet6 fe80::ec4:7aff:fe32:c182/64 scope link tentative Dec 19 06:57:11 shorty root[1415]: valid_lft forever preferred_lft forever
Gordon Messmer writes:
Thanks, Sam, that looks like very useful information. The logs you posted indicate that one interface, eno1, had no link when "ip addr show" ran, after NetworkManager reported itself online. This seems consistent with nm- online's man page which indicates that startup is complete when all connections are available "given the current network state.
The old "network" service would simply set the interface state to "up" regardless of whether or not there was a link, and further it had a LINKDELAY setting to ensure that the system would pause some fixed time (the admin's best guess, I suppose) before it continued.
I follow this, mostly, but...
The big picture is that many services expect to be able to bind to some preconfigured IP address. If this was just, say, privoxy, you could call it an outlier. But it's not just privoxy. Also openssh, and in fact openssh was so badly affected that it doesn't even bother having a dependency on network- online.target, it just hooks up to network.target, and the service file has a hardcoded retry interval of 40 seconds to try to restart the service.
Pretty sure that innd will also barf, although I'm not running it right now.
It is also quite common to preconfigure well-known services to listen on specific IP addresses only, for security reasons, or otherwise policy reasons. HTTP (apache), SMTP (sendmail, postfix, etc…), IMAP. All quite common, and reasonable, to configure them to accept incoming connections on specific network ports only. Privoxy is a special case. You have to make it listen only on internal IP addresses, otherwise it's a gaping security hole.
The bottom line is that it is not unreasonable to preconfigure services to bind to specific, known, IP addresses; and furthermore to be able to reliably start them at system boot when those fixed, static, IP addresses are available. Things worked like that for a very, very long time.
That's the big picture. And looks like it's completely impossible to do that, in stock Fedora. Which is a shame. Whatever the actual reasons for this would be; I think it's purely acadamic. It should be possible to do this without pulling one's hair out, and without resorting to various workarounds.
On 12/19/2017 04:46 PM, Sam Varshavchik wrote:
That's the big picture. And looks like it's completely impossible to do that, in stock Fedora.
Right now, yes. And that's completely and entirely down to NetworkManager bringing interfaces up in an event-driven fashion, when link is detected. Nothing at all to do with systemd. It seems to hit relatively few people, because it appears to only be a problem when the NIC and the switch don't establish an active link for much longer than 5 seconds. It's kind of unreasonable for the link to take that long, IMHO. But, reasonable or not, it happens with some managed switches. I think something similar is the root of Francis' NFS mounting problems. In his case, the link comes up and addresses are configured, but the switch won't actually forward packets for a while, probably due to Spanning Tree support.
All this stuff is a PITA, but that's life with managed switches. A lot of software is written and tested with network equipment where timing issues and events are less of an issue. It took way too long for NetworkManager to address the long link delay that appears to be causing your issue, but the code's in place. We just need to push the Fedora maintainer to update the package so that you and others can actually verify the fix.
I've filed an RFE:
Gordon Messmer writes:
On 12/19/2017 04:46 PM, Sam Varshavchik wrote:
That's the big picture. And looks like it's completely impossible to do that, in stock Fedora.
Right now, yes. And that's completely and entirely down to NetworkManager bringing interfaces up in an event-driven fashion, when link is detected. Nothing at all to do with systemd. It seems to hit relatively few people, because it appears to only be a problem when the NIC and the switch don't establish an active link for much longer than 5 seconds. It's kind of unreasonable for the link to take that long, IMHO. But, reasonable or not, it happens with some managed switches. I think something similar is the
No managed switches here. Just some dumb switch with a bunch of stuff plugged into it.
On 20 Dec 2017 04:54, "Gordon Messmer" gordon.messmer@gmail.com wrote:
On 12/19/2017 04:46 PM, Sam Varshavchik wrote:
That's the big picture. And looks like it's completely impossible to do that, in stock Fedora.
Right now, yes. And that's completely and entirely down to NetworkManager bringing interfaces up in an event-driven fashion, when link is detected. Nothing at all to do with systemd.
You're right this has nothing to do with systemd ... and it's honestly a difficult problem to solve within NM without risking a system that fails to boot at all without a network detected online.
It is worth looking into how we might improve the nm-online behaviour though.
In the meanwhile when there are services that require binding to a specific address and it's possible that address hasn't yet arrived on the system there is a better way to handle it, one which is well tested as it's frequently used with software for high availability such as keepalived ...
There is an option when creating a socket called FREEBIND which allows binding too an address not present in the system.
This is required to be set during the actual binding of the socket by the application.
For applications using a systemd socket this is as simple as setting Freebind=true in the .socket file.
For other applications there may be a configuration option or command argument to enable it. Consider filing upstream bugs where you want this possible with an application. For example haproxy has this as optional behaviour IIRC
For applications where this is not an option there is a sledgehammer approach at the kernel level to enable this behaviour on all socket binds via the sysctl
ip_nonlocal_bind
Also consider if you *really need* that specific IP bind as binding to 0.0.0.0 or :: and using firewall rules to allow or prevent access per interface will never face this problem.
On Tue, 2017-12-19 at 17:37 -0500, Sam Varshavchik wrote:
Justin Moore writes:
I interpret "-s" to mean "all interfaces are active but do not necessarily have an address or a default route". This means that NM will return success
What does it mean for an interface that has a static IP address explicitly specified in its ifcfg file to be "active", but not have that IP address?
The problem with the former is that it can return before your system can reach the outside world (e.g., interface is up but doesn't have a DHCP-
The interface in question has a static IP address. DHCP is not in the picture.
But an argument can be made that even if a network interface is configured via DHCP, then it's not "active" until it succeeds in acquiring an IP address. But that's besides the point, because these are simple, garden variety, static IP addresses. Can be any simpler than that.
Furthermore, I distinctly recall incidents where something got fubared with my DHCP server, and I had to drink a cup of coffee before other servers finished booting.
It's been my experience that NetworkManager most definitely puts the brakes on booting the server if dhclient can't get an IP address.
So, given that it throws a tantrum if dhclient can't get an IP address, I'm quite puzzled that it also just blows past an network port with a static IP address, but before it is fully configured with that IP address. This does not compute.
assigned address). The problem with the latter is that if your system has multiple interfaces, as soon as *one* of them has an address and a route, NM says all is well and continues, *even if that interface can't reach the
That's not how the nm-online man page describes the -s option. If that's not what the -s option does, then I have no idea what it's supposed to be doing.
along before my actual network interface got an address. This caused issues with various services, and -- if you check the mythtv-users archives -- you'll see that the systemd people's response was "working as intended, that's a bug in mythtv and the other pieces of software which don't adapt to network interfaces which appear and disappear randomly."
Well, I wouldn't really expect any less flippant or arrogant response there, this does not surprise me. But that's not important.
What is important, is that it's an inescapable conclusion that the only workable solution for this mess is something on the order of my script that routes around the damage, and beats systemd in submission. It's horribly ugly, I'm embarassed that I actually had write such a clunker, but it seems to actually makes network-online.target do what it's supposed to be doing in the first place, but is apparently too complicated for either systemd, or NetworkManager, to do correctly on their own.
So be it. Life's too short.
I think I have tracked down my problem with the help of nm-wait-online-routes-gw.conf !
I have no explanation as to why this has started happening after recent updates.
I had to turn the .conf into a script to get it to work - finger trouble I expect. ja@hayling NetworkManager-wait-online.service.d 4$ cat script.conf [Unit] [Service] ExecStart=/usr/bin/ja_nm-online.sh [Install]
ja@hayling NetworkManager-wait-online.service.d 6$ cat /usr/bin/ja_nm-online.sh #!/bin/bash # "LF=$'\n'; "' \
ROUTES=" \ default \ 148.197.29.0/24 \ "; \ GW=148.197.29.5; \ Not a Gateway but my server
I beleive my NFS "failures" are caused by the VMware network interfaces "sometimes" coming up before the "real" network interfaces. (VMware Workstation 14 uses its own DHCP server not my DHCP server) nm-online -s takes the vmware interfaces as a "working network". nm-online (no -s) does not.
This "boot" would probably have caused an NFS mount failure Just two interfaces up - before DHCP to my server has run.
Dec 20 13:35:50 hayling.jaa.org.uk NetworkManager[1035]: <info> [1513776950.6441] manager: startup complete Dec 20 13:35:50 hayling.jaa.org.uk ja_nm-online.sh[1344]: 172.16.91.0/24 dev vmnet8 proto kernel scope link src 172.16.91.1 Dec 20 13:35:50 hayling.jaa.org.uk ja_nm-online.sh[1344]: 192.168.111.0/24 dev vmnet1 proto kernel scope link src 192.168.111.1 Dec 20 13:35:51 hayling.jaa.org.uk kernel: e1000e: enp0s31f6 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Dec 20 13:35:51 hayling.jaa.org.uk kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp0s31f6: link becomes ready ...
Later all interfaces are up - after DHCP to my server has run
Dec 20 13:35:51 hayling.jaa.org.uk ja_nm-online.sh[1344]: default via 148.197.29.254 dev enp0s31f6 proto static metric 100 Dec 20 13:35:51 hayling.jaa.org.uk ja_nm-online.sh[1344]: 148.197.29.0/24 dev enp0s31f6 proto kernel scope link src 148.197.29.202 metric 100 Dec 20 13:35:51 hayling.jaa.org.uk ja_nm-online.sh[1344]: 172.16.91.0/24 dev vmnet8 proto kernel scope link src 172.16.91.1 Dec 20 13:35:51 hayling.jaa.org.uk ja_nm-online.sh[1344]: 192.168.111.0/24 dev vmnet1 proto kernel scope link src 192.168.111.1 Dec 20 13:35:51 hayling.jaa.org.uk ja_nm-online.sh[1344]: Network Manager Wait Online routes took 1 seconds Dec 20 13:35:51 hayling.jaa.org.uk ja_nm-online.sh[1344]: Network Manager Wait Online gateway took 0 seconds Dec 20 13:35:51 hayling.jaa.org.uk systemd[1]: Started Network Manager Wait Online. Dec 20 13:35:51 hayling.jaa.org.uk audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-wait-online comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Dec 20 13:35:51 hayling.jaa.org.uk systemd[1]: Reached target Network is Online. Dec 20 13:35:51 hayling.jaa.org.uk systemd[1]: Mounting /global...
On 12/20/2017 03:44 AM, Sam Varshavchik wrote:
No managed switches here. Just some dumb switch with a bunch of stuff plugged into it.
Surprising, but the outcome is the same. You have a switch and a NIC which, in concert, take a long time to negotiate a link, and that's a problem for NetworkManager.
Did you add yourself to the RFE I filed? It should be notified when an updated NM is available for testing.
Gordon Messmer writes:
On 12/20/2017 03:44 AM, Sam Varshavchik wrote:
No managed switches here. Just some dumb switch with a bunch of stuff plugged into it.
Surprising, but the outcome is the same. You have a switch and a NIC which, in concert, take a long time to negotiate a link, and that's a problem for NetworkManager.
I'm not sure I see why this has to be a factor.
I just did a little experiment. I took a different server with an unused port. I assigned an IP address to it, in its ifcfg file, then ifup-ed it. The result:
3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 link/ether 00:30:48:fc:83:fb brd ff:ff:ff:ff:ff:ff inet 192.168.2.1/24 brd 192.168.2.255 scope global eth1 valid_lft forever preferred_lft forever
I then tested this using my script. I can bind it, without any issues. strace:
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.2.1")}, 16) = 0 close(3)
Now, perhaps if networkmanager wants to wait a while, until there's a carrier on that port, wonderful.
But why is there a problem with simply reading a configuration file that says: "assign this static IP address to the port", and then assigning this static IP address to the port? That's all anyone wants to do here. Nothing more. Everything will then boot normally.
There's such a basic, fundamental disconnect here, somewhere, that I'm not sure I can even describe it correctly. But to me, all of this seems to be a no-brainer. There's an IP address in the ifcfg file. nm-online simply needs to wait until this IP address is assigned to the network port. The End.
Hi.
On Thu, 21 Dec 2017 07:17:38 -0500 Sam Varshavchik wrote:
I just did a little experiment. I took a different server with an unused port. I assigned an IP address to it, in its ifcfg file, then ifup-ed it. The result:
3: eth1: <snip>
I then tested this using my script. I can bind it, without any issues.
Right, but in the case you want to do an NFS mount, you better have to wait the switch and the NIC to have finished to negociate (otherwise you may hit the timeout of mount).
This should perhaps be seen as two different problems.
A way to minimize the risk of a failed NFS mount is to use some a the various automount existing mechanisms, but that may not be sufficient if another service (ex: a mysql service having its database under NFS) requires the NFS mount.
Francis.Montagnac@inria.fr writes:
Hi.
On Thu, 21 Dec 2017 07:17:38 -0500 Sam Varshavchik wrote:
I just did a little experiment. I took a different server with an unused port. I assigned an IP address to it, in its ifcfg file, then ifup-ed it. The result:
3: eth1: <snip>
I then tested this using my script. I can bind it, without any issues.
Right, but in the case you want to do an NFS mount, you better have to wait the switch and the NIC to have finished to negociate (otherwise you may hit the timeout of mount).
Which is fine. But nm-online claims that the port is up even before the IP address is assigned.
After the IP address is assigned, if nm-online wants to a wait a while to see if the carrier comes up, nothing wrong with that. But nm-online seems to be seeing the fact that this port has a static IP address, and appears to decide that nothing needs to be done about it. Let's proceed and boot the rest of the system, that expects the IP addresses to be already configured. Who cares about the fact that the IP addresses that are basically hard-coded for the server are not even up yet? We say we're online, we must mean that we're online.
This should perhaps be seen as two different problems.
A way to minimize the risk of a failed NFS mount is to use some a the various automount existing mechanisms, but that may not be sufficient if another service (ex: a mysql service having its database under NFS) requires the NFS mount.
Yes, that's a separate issue. The basic issue here: this port has fixed, static IP addresses set for it. That means that, at the very, bare minimum, the IP addresses must be bound to the port before it is considered to be active. I have no idea what definition of "active" includes "the port doesn't yet have the fixed IP addresses it should have".
This has always been the case, for as long as I can remember. And even if the port is physically connected to a switch, and a switch is up, that switch may not be connected to anything else. So, the presence of a carrier, on the port, is utterly irrelevant when it comes to ports with static IP addresses.
Yes, when you have DHCP, you want to sense the connection, as an indication to try to acquire an IP address. But if the port has static IP addresses, take those bleeping IP addresses, add them to the port, and the port is up. As far as you're concerned, the port is up. That's it. The presence or the absence of a sensed connection means utterly NOTHING. Whether there's a physical connection there, or not, that port will always have those IP addresses, and absolutely NOTHING, whatsoever, will change that. That's how it always worked.
On 12/21/2017 04:17 AM, Sam Varshavchik wrote:
I'm not sure I see why this has to be a factor.
Because NetworkManager is an event-driven system. If an interface loses carrier for a defined period of time (5 seconds prior to the commit I've referenced), NM will remove the IP configuration from that interface. When an interface changes state and gets a link, NM will add the IP configuration to that interface. There isn't an option in NM to assign configurations to interfaces regardless of their state.
The fundamental problem is that NM had a hard-coded delay for link state changes. Prior to NM, some systems still had boot problems due to long link negotiation periods, as in Francis' case with NFS mount failures. The old "network" service had provisions for this. You could set a custom delay if you had an interface that took a long time to come up. NM didn't have this, which contributes both to your problem and to problems like Francis'.
But why is there a problem with simply reading a configuration file that says: "assign this static IP address to the port", and then assigning this static IP address to the port? That's all anyone wants to do here. Nothing more. Everything will then boot normally.
You can always suggest to the NM team that event-driven configuration is inappropriate in some instances, and that interfaces should be configured regardless of state.
There's such a basic, fundamental disconnect here, somewhere, that I'm not sure I can even describe it correctly. But to me, all of this seems to be a no-brainer. There's an IP address in the ifcfg file. nm-online simply needs to wait until this IP address is assigned to the network port. The End.
I understand what you're saying. It's just that static configuration isn't what NM does. The old network service does, and I sincerely hope no one ever suggests deprecating the network service unless NM offers that mode of operation, which it does not, today.
On Thu, 21 Dec 2017 09:28:27 -0800 Gordon Messmer wrote:
It's just that static configuration isn't what NM does.
Which always struck me as very odd considering that redhat sponsors most of this stuff and mostly makes money off servers which almost always have static configurations that never change once they are up and running.
You think they'd have wanted it to work well in a server environment as well as a laptop that flits from one starbucks to another.
On 12/21/2017 10:05 AM, Tom Horsley wrote:
It's just that static configuration isn't what NM does.
Which always struck me as very odd considering that redhat sponsors most of this stuff and mostly makes money off servers
Mostly, yes. They do have some desktop people, and sponsor GNOME development, which isn't server-oriented.
It's not that odd.
GNU/Linux needed NetworkManager. Static configuration is inadequate for most classes of devices, where an event-driven system is a better fit for the needs of those devices. If NM ever gets an entirely static mode for an interface, it might be suited to replace the "network" service. Until then, the network service is still available for systems where its mode of operation is a better fit than NM.
Gordon Messmer writes:
I understand what you're saying. It's just that static configuration isn't what NM does. The old network service does, and I sincerely hope no one ever suggests deprecating the network service unless NM offers that mode of operation, which it does not, today.
I'm trying to recall the history of introduction of NM. I'm pretty sure that, at some point, a new Fedora release made NM the default network configuration manager. I certainly do not recall explicitly enabling network manager for my existing servers.
And, in fact, NM will definitely own a particular interface unless one explicitly shoves NM_CONTROLLED=NO, doesn't it? So, in fact we have NM getting attached, by default, by default, for all network interfaces unless one takes explicit steps to exclude it.
Maybe, perhaps, this approach should've made sense to do so only when NM actually was able to support all the functionality it was taking over? How's that for a crazy idea? Does anyone think I'm being completely off-base and unreasonable, expecting to things to continue to work, as is, by default?
I don't mind if network service got replaced by NM, or by some other infrastructure. Whatever it is, I'll learn to use it. I just
don't want my stuff to break.
I don't want to figure out exactly, and in what precise way, something is now broken, and then figure out how to hack my way around it.
And I am not singling out just NetworkManager. It's merely a symptom of the latest malaise that's infecting too many people – "ooh! shiny ball!" – and then getting fixated on the new, shiny, spinning ball, and then ripping out reliable, working functionality, replacing it with a bunch of bells and whistles that don't quite replace what was there before.
On Sat, 23 Dec 2017 10:50:44 -0500 Sam Varshavchik wrote:
Maybe, perhaps, this approach should've made sense to do so only when NM actually was able to support all the functionality it was taking over? How's that for a crazy idea? Does anyone think I'm being completely off-base and unreasonable, expecting to things to continue to work, as is, by default?
Welcome to the world of fedora! Fedora is supposed to be cutting edge, but it is too bad that often turns out to be broken edge instead :-).
It did seem to take about 10 years or so for all the features of the old network service to at least supposedly be supported in nm, I'd get a new fedora and say "Ah! Maybe nm works in this release", try it for a while, then go back to network (it is a blessing that they didn't eradicate network as happens all to often with new shiny features).
Tom Horsley writes:
On Sat, 23 Dec 2017 10:50:44 -0500 Sam Varshavchik wrote:
Maybe, perhaps, this approach should've made sense to do so only when NM actually was able to support all the functionality it was taking over?
How's
that for a crazy idea? Does anyone think I'm being completely off-base and unreasonable, expecting to things to continue to work, as is, by default?
Welcome to the world of fedora! Fedora is supposed to be cutting edge, but it is too bad that often turns out to be broken edge instead :-).
To me "cutting edge" means "new stuff that may or may not work". It doesn't mean "stuff that randomly breaks existing stuff that works". I think there's a lot of ground in between. It's not a nuanced distinction.
Let's draw an analogy with something else. Accelerated graphics. Folks who were around that era remember quite well how the things progressed, with the X RENDER extension, and such.
It was an entirely new X infrastructure. There were plenty of growing pains. Certain hardware worked. Then it broke, then it worked again. This went on for a while.
But throughout the evolution of accelerated graphics infrastructure, the traditional unaccelerated X worked flawlessly. You knew you could always, at least, turn off acceleration and have a working, functional desktop. You would no longer get to play with all the new toys, accelerated graphics and video, but you, at least, always go back to the way things were before. It never broke. And it still works just fine. Existing stuff just
did not break.
It did seem to take about 10 years or so for all the features of the old network service to at least supposedly be supported in nm, I'd get a new fedora and say "Ah! Maybe nm works in this release", try it for a while, then go back to network (it is a blessing that they didn't eradicate network as happens all to often with new shiny features).
And see, that's the other thing. How about just making "all the features of the old network service" continue to putter along, in some fashion, while NM works on building its own new shiny ball? If NM is going to be the default, either make it a default for new installs and not upgrades; or make it fully support the existing functionality. That's all.
I originally had a lot of other things that I wrote down, but I just deleted it. It's just screaming into the air. Instead, I'll just sign off by presenting a small puzzle to solve.
Let's take two packages in Fedora: ncurses and pcre. Tons of stuff depends on them. dnf won't let me remove pcre, because dnf itself depends on it. And I was shocked to see that dnf won't let me remove ncurses either because sudo depends on it. Who knew. Despite the dominance of GUI desktops, you still have fundamental dependencies on plain old curses. I have no idea how much stuff really depends on these two. Must be a lot. Using rpm just to test the first level of dependencies, I count 37 for pcre. Just 3 for curses, but that's just first order dependencies.
I never, ever, read the same kind of bitching, the same kind of complaints about ncurses and pcre breaking existing functionality, all the time. It occured to me this week that there is a very obvious and fundamental reason why that is so, and it has absolutely nothing to do with any technical factor, or complexity. It's more simple, and fundamental than that. And that's my puzzle.
I'll just leave a small clue. Both of them have one general attribute in common, that's not present in NM, and pretty much every other package that's constantly bitched about. Figuring out what it is, will explain a lot. I'll just add a postscriptum that this is not an exclusive-or condition. Plenty of stuff has a solid reputation for stability, without having this particular aspect to it; but stuff that constantly triggers a non-stop source of complaints – with few exceptions you can always say one thing about it, as a whole.
I would have thought something with "wait online" in its name would actually wait for it to be on-line. To me, "on-line" means connected and operational. Starting up but not actually on-line, isn't on-line.
I find it interesting that freedesktop.org suggests that the entire purpose of NetworkManager-wait-online is to make sure that remote drives mount AFTER an IP address has been configured:
https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/
That's exactly the use case that's broken here.
By the way, my own experiments (on a lab of twenty-five FC26 workstations with dynamic IP addresses) seem to confirm that removing -s fixes the issue. Increasing the timeout from 30 to 45 was enough to keep switch NetworkManager-wait-online itself from timing out due to switch glitchiness.
I agree that the nm-online manpage is confusing. My experiments seem to indicate that perhaps "active" means the network interface is "up" in the sense that the interface is listening for data-link layer (i.e. Ethernet) frames. It doesn't seem like it means the interface actually has an IP address configured.
Robert Marmorstein
On 12/23/2017 11:54 AM, Marmorstein, Robert wrote:
I agree that the nm-online manpage is confusing. My experiments seem to indicate that perhaps "active" means the network interface is "up" in the sense that the interface is listening for data-link layer (i.e. Ethernet) frames. It doesn't seem like it means the interface actually has an IP address configured.
File a bug report. Either it should wait for an IP address, or the documentation is wrong. One way or another, it's a bug.
Marmorstein, Robert writes:
NetworkManager-wait-online is to make sure that remote drives mount AFTER an IP address has been configured:
https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/
That's exactly the use case that's broken here.
By the way, my own experiments (on a lab of twenty-five FC26 workstations with dynamic IP addresses) seem to confirm that removing -s fixes the issue. Increasing the timeout from 30 to 45 was enough to keep switch NetworkManager-wait- online itself from timing out due to switch glitchiness.
So, getting rid of -s fixes network connections that use dhcp, perhaps in combination with an increased timeout. But getting rid of -s definitely breaks static IP connections even worse than they are already broken. With the default configuration with the -s option, the IP addresses are not assigned for about 1-2 seconds until after nm-online returns, based on the data that I collected over the last week; and without the -s options nm- online returns 5-6 seconds before my script succeeds in binding the static IP address. With the -s option you have a fighting chance of booting a functional server with static IP addresses. Without the -s option you're pretty much are assured of a farked boot, with all services that bind to specific IP addresses failing to start (except for openssh whose service file has an explicit retry interval).
Just to add to this clusterfracas: systemd itself has a timeout for starting jobs. Looks like if you try to increase the timeout to more than 90 seconds, it won't work. This is buried in the man pages. systemd.service(5) documents that the default value for TimeoutStartSec comes from DefaultTimeoutStartSec "from the manager configuration file", which must mean something that's buried in /etc/NetworkManager. I don't see it set anywhere which must mean, according to systemd-system.conf(5) that it takes its hardcoded default of 90 seconds.
So, looks like we're going to have to throw both NetworkManager and systemd into a cage ring. Fight to the death. Whoever comes out, gets to set the startup timeout.
On 12/21/2017 06:28 PM, Gordon Messmer wrote:
Because NetworkManager is an event-driven system. If an interface loses carrier for a defined period of time (5 seconds prior to the commit I've referenced), NM will remove the IP configuration from that interface.
And this is one of the most stupid things we've ever imported from Windows.
Roberto Ragusa writes:
On 12/21/2017 06:28 PM, Gordon Messmer wrote:
Because NetworkManager is an event-driven system. If an interface loses
carrier for a defined period of time (5 seconds prior to the commit I've referenced), NM will remove the IP configuration from that interface.
And this is one of the most stupid things we've ever imported from Windows.
… for interfaces with static IP addresses.
This is quite reasonable, and most likely unavoidable for DHCP-based network ports. But for ports with explicitly defined static IP addresses, having them configured based on the presence of a carrier makes no logical sense, whatsoever, and ends up accomplishing absolutely nothing useful except breaking servers. As we've seen happen, time and time again.
I defy anyone to identify a tangible benefit that comes from removing a static IP address from a port when it loses carrier, and installing one only once a carrier is present. I tried to play devil's advocate, but couldn't come up with a single damn reason.
Once upon a time, Sam Varshavchik mrsam@courier-mta.com said:
I defy anyone to identify a tangible benefit that comes from removing a static IP address from a port when it loses carrier, and installing one only once a carrier is present.
It is useful for systems with multiple interfaces, for example a desktop with wired and wifi, and different preference default routes out both (so if the wired goes down, traffic can still go out over the wifi). Anything acting like a router also needs this behavior, typically in conjuction with dynamic routing protocols.
On 12/24/2017 10:00 PM, Chris Adams wrote:
Once upon a time, Sam Varshavchik mrsam@courier-mta.com said:
I defy anyone to identify a tangible benefit that comes from removing a static IP address from a port when it loses carrier, and installing one only once a carrier is present.
It is useful for systems with multiple interfaces, for example a desktop with wired and wifi, and different preference default routes out both (so if the wired goes down, traffic can still go out over the wifi). Anything acting like a router also needs this behavior, typically in conjuction with dynamic routing protocols.
Routing fail over is not a good reason to totally unconfigure an interface, especially on machines where there is nowhere else to send packets. You should just change the default gw, maybe.
Carrier is a local thing, routing ability is a global thing, and trying to mix the two is wrong. Should my ethernet be unconfigured when the DSL copper cable in my router is unplugged? So, why should it be unconfigured when the ethernet cable between my DSL modem and me is?
On 01/01/2018 03:09 AM, Roberto Ragusa wrote:
On 12/24/2017 10:00 PM, Chris Adams wrote:
Once upon a time, Sam Varshavchik mrsam@courier-mta.com said:
I defy anyone to identify a tangible benefit that comes from removing a static IP address from a port when it loses carrier, and installing one only once a carrier is present.
It is useful for systems with multiple interfaces, for example a desktop with wired and wifi, and different preference default routes out both (so if the wired goes down, traffic can still go out over the wifi). Anything acting like a router also needs this behavior, typically in conjuction with dynamic routing protocols.
Routing fail over is not a good reason to totally unconfigure an interface, especially on machines where there is nowhere else to send packets. You should just change the default gw, maybe.
As a network admin, I can see no reason to remove a fixed IP address from a NIC based on whether or not there's a carrier present. Even in the case of DHCP, unless the address lease expires between a disconnect and reconnect or there's pressure on the DHCP pool, most DHCP servers will try to give a client NIC the same IP address it had when it disconnected. Thus, even in those cases, the address "sticks".
Route failover is a common thing to do. We have multiple uplinks on our routers, a primary and at least one backup with each uplink on a different ISP. If the primary goes down the router fails over to (one of) the secondary uplink(s) based on cost factors. However the primary is NOT deconfigured and its IP address remains assigned. If the primary uplink comes back up, then the routes fail back to it. This is standard practice. There's no need to remove its fixed IP address while it's down (and, in fact, would be a "bad idea").
Now, if we want one of the uplinks to remain unavailable, it is marked "administratively down" (in Cisco IOS-speak). For Linux, this would essentially be "ip link dev <devname> down". This takes the link down, but doesn't change the IP address for it.
To remove the address, you'd "ip addr del <addr> dev <devname>". Utterly unnecessary (and sort of stupid) if you do the "ip link" command.
That's my stubborn, curmudgeon-y opinion. ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer, AllDigital ricks@alldigital.com - - AIM/Skype: therps2 ICQ: 226437340 Yahoo: origrps2 - - - - Never put off 'til tommorrow what you can forget altogether! - ----------------------------------------------------------------------
Once upon a time, Rick Stevens ricks@alldigital.com said:
Now, if we want one of the uplinks to remain unavailable, it is marked "administratively down" (in Cisco IOS-speak). For Linux, this would essentially be "ip link dev <devname> down". This takes the link down, but doesn't change the IP address for it.
The difference is that Cisco can leave an interface configured but inactive, while Linux doesn't really support that. On a Cisco, the IP will still show up configured on the interface, but things like "show ip route <IP>" will not show it (it is not actually active when the interface is down).
If you configure a Cisco interface for 10.0.0.1/24 and the link drops, the route for 10.0.0.0/24 is removed from the routing table. The interface shows down, but it is still watching for link to come back. When that happens, 10.0.0.0/24 is re-added to the routing table.
On Linux, with static config, the route would still point out the down interface. If you shut the link down, the route will be removed, but then the interface will not come back automatically when link returns. You have to have some type of user-space active network management to get basic router-like behavior (functionally, that's how Cisco actually works, it's just all under the hood).
On 01/03/2018 04:53 PM, Rick Stevens wrote:
As a network admin, I can see no reason to remove a fixed IP address from a NIC based on whether or not there's a carrier present. Even in the case of DHCP, unless the address lease expires between a disconnect and reconnect or there's pressure on the DHCP pool, most DHCP servers will try to give a client NIC the same IP address it had when it disconnected. Thus, even in those cases, the address "sticks".
Even if the IP address stays the same, other parameters can change. I recently had an issue with my cable modem needing to be re-provisioned on Comcast's network. My system did not recognize the change in DNS servers when the modem came back up (the change from the "Walled Garden" DNS servers to the standard ones). Took me a while to figure out what was wrong and do a manual "ifdown" and "ifup" to get things right. With NetworkManager, it would have been automatic.