Hi,
We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor):
Oct 6 23:36:21 ****** ovs-ctl[623]: Starting ovsdb-server [ OK ] Oct 6 23:36:22 ****** ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.6.2 Oct 6 23:36:22 ****** ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=2.3.1 "external-ids:system-id="47ff9309-5609-47e0-819c-b9055b25edbb"" "system-type="CentOS"" "system-version="7.1.1503-Core"" Oct 6 23:36:22 ****** ovs-ctl[623]: Configuring Open vSwitch system IDs [ OK ] Oct 6 23:36:22 ****** network[733]: Bringing up loopback interface: [ OK ] Oct 6 23:36:22 ****** kernel: [ 6.158533] gre: GRE over IPv4 demultiplexor driver Oct 6 23:36:22 ****** systemd[1]: Starting system-teamd.slice. Oct 6 23:36:22 ****** systemd[1]: Created slice system-teamd.slice. Oct 6 23:36:22 ****** systemd[1]: Starting Team Daemon for device bond0... Oct 6 23:36:22 ****** kernel: [ 6.199635] openvswitch: Open vSwitch switching datapath Oct 6 23:36:22 ****** ovs-ctl[623]: Inserting openvswitch module [ OK ] Oct 6 23:36:22 ****** kernel: [ 6.338577] device ovs-system entered promiscuous mode Oct 6 23:36:22 ****** kernel: [ 6.340086] openvswitch: netlink: Unknown key attribute (type=62, max=21). Oct 6 23:36:22 ****** kernel: [ 6.385293] device br-ex entered promiscuous mode Oct 6 23:36:22 ****** kernel: [ 6.426511] device br-int entered promiscuous mode Oct 6 23:36:22 ****** teamd[857]: Failed to get interface information list. Oct 6 23:36:22 ****** teamd[857]: Failed to init interface information list. Oct 6 23:36:22 ****** teamd[857]: Team init failed. Oct 6 23:36:22 ****** teamd[857]: teamd_init() failed. Oct 6 23:36:22 ****** teamd[857]: Failed: Invalid argument Oct 6 23:36:22 ****** systemd[1]: teamd@bond0.service: main process exited, code=exited, status=1/FAILURE Oct 6 23:36:22 ****** network[733]: Bringing up interface bond0: Job for teamd@bond0.service failed. See 'systemctl status teamd@bond0.service' and 'journalctl -xn' for details. Oct 6 23:36:22 ****** kernel: [ 6.433515] device br-tun entered promiscuous mode Oct 6 23:36:22 ****** systemd[1]: Unit teamd@bond0.service entered failed state. Oct 6 23:36:22 ****** ovs-ctl[623]: Starting ovs-vswitchd [ OK ] Oct 6 23:36:22 ****** network[733]: [FAILED] Oct 6 23:36:22 ****** ovs-ctl[623]: Enabling remote OVSDB managers [ OK ]
Is this a known issue?
Chris
Wed, Oct 07, 2015 at 10:15:35AM CEST, ctcard@hotmail.com wrote:
Hi,
We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor):
Oct 6 23:36:21 ****** ovs-ctl[623]: Starting ovsdb-server [ OK ] Oct 6 23:36:22 ****** ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.6.2 Oct 6 23:36:22 ****** ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=2.3.1 "external-ids:system-id="47ff9309-5609-47e0-819c-b9055b25edbb"" "system-type="CentOS"" "system-version="7.1.1503-Core"" Oct 6 23:36:22 ****** ovs-ctl[623]: Configuring Open vSwitch system IDs [ OK ] Oct 6 23:36:22 ****** network[733]: Bringing up loopback interface: [ OK ] Oct 6 23:36:22 ****** kernel: [ 6.158533] gre: GRE over IPv4 demultiplexor driver Oct 6 23:36:22 ****** systemd[1]: Starting system-teamd.slice. Oct 6 23:36:22 ****** systemd[1]: Created slice system-teamd.slice. Oct 6 23:36:22 ****** systemd[1]: Starting Team Daemon for device bond0... Oct 6 23:36:22 ****** kernel: [ 6.199635] openvswitch: Open vSwitch switching datapath Oct 6 23:36:22 ****** ovs-ctl[623]: Inserting openvswitch module [ OK ] Oct 6 23:36:22 ****** kernel: [ 6.338577] device ovs-system entered promiscuous mode Oct 6 23:36:22 ****** kernel: [ 6.340086] openvswitch: netlink: Unknown key attribute (type=62, max=21). Oct 6 23:36:22 ****** kernel: [ 6.385293] device br-ex entered promiscuous mode Oct 6 23:36:22 ****** kernel: [ 6.426511] device br-int entered promiscuous mode Oct 6 23:36:22 ****** teamd[857]: Failed to get interface information list. Oct 6 23:36:22 ****** teamd[857]: Failed to init interface information list. Oct 6 23:36:22 ****** teamd[857]: Team init failed. Oct 6 23:36:22 ****** teamd[857]: teamd_init() failed. Oct 6 23:36:22 ****** teamd[857]: Failed: Invalid argument Oct 6 23:36:22 ****** systemd[1]: teamd@bond0.service: main process exited, code=exited, status=1/FAILURE Oct 6 23:36:22 ****** network[733]: Bringing up interface bond0: Job for teamd@bond0.service failed. See 'systemctl status teamd@bond0.service' and 'journalctl -xn' for details. Oct 6 23:36:22 ****** kernel: [ 6.433515] device br-tun entered promiscuous mode Oct 6 23:36:22 ****** systemd[1]: Unit teamd@bond0.service entered failed state. Oct 6 23:36:22 ****** ovs-ctl[623]: Starting ovs-vswitchd [ OK ] Oct 6 23:36:22 ****** network[733]: [FAILED] Oct 6 23:36:22 ****** ovs-ctl[623]: Enabling remote OVSDB managers [ OK ]
Is this a known issue?
Not to me.
Need more info, debug messages.
Wed, Oct 07, 2015 at 10:15:35AM CEST, ctcard@hotmail.com wrote:
Hi,
We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor):
<snip> Is this a known issue?
Not to me.
Need more info, debug messages.
Thanks, I'll try and reproduce with teamd running with debug output
Chris
Wed, Oct 07, 2015 at 10:15:35AM CEST, ctcard@hotmail.com wrote:
Hi,
We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor):
<snip> Is this a known issue?
Not to me.
Need more info, debug messages.
Thanks, I'll try and reproduce with teamd running with debug output
I've tried to reproduce with teamd running with -g, but so far it hasn't happened again. My guess is that it is a timing issue/race condition, so making teamd run with debug out may even slow it down enough to stop it happening. Can you give me any idea what teamd might be racing against?
Chris
_______________________________________________ libteam mailing list libteam@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/libteam
Thu, Oct 08, 2015 at 09:35:57AM CEST, ctcard@hotmail.com wrote:
Wed, Oct 07, 2015 at 10:15:35AM CEST, ctcard@hotmail.com wrote:
Hi,
We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor):
<snip> Is this a known issue?
Not to me.
Need more info, debug messages.
Thanks, I'll try and reproduce with teamd running with debug output
I've tried to reproduce with teamd running with -g, but so far it hasn't happened again. My guess is that it is a timing issue/race condition, so making teamd run with debug out may even slow it down enough to stop it happening. Can you give me any idea what teamd might be racing against?
No clue, this is quite odd. Try to compile teamd yourself and find out where exactly the error happens.
Chris
libteam mailing list libteam@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/libteam
Wed, Oct 07, 2015 at 10:15:35AM CEST, ctcard@hotmail.com wrote:
Hi,
We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor):
<snip> Is this a known issue?
Not to me.
Need more info, debug messages.
Thanks, I'll try and reproduce with teamd running with debug output
I've tried to reproduce with teamd running with -g, but so far it hasn't happened again. My guess is that it is a timing issue/race condition, so making teamd run with debug out may even slow it down enough to stop it happening. Can you give me any idea what teamd might be racing against?
No clue, this is quite odd. Try to compile teamd yourself and find out where exactly the error happens.
I just had this occur again, this time with teamd running with debug output, but there was no useful extra information that I could see. I did look at the source code for teamd, and the error occurs when this function fails:
int get_ifinfo_list(struct team_handle *th) { struct nl_cb *cb; struct nl_cb *orig_cb; struct rtgenmsg rt_hdr = { .rtgen_family = AF_UNSPEC, }; int ret;
ret = nl_send_simple(th->nl_cli.sock, RTM_GETLINK, NLM_F_DUMP, &rt_hdr, sizeof(rt_hdr)); if (ret < 0) return -nl2syserr(ret); orig_cb = nl_socket_get_cb(th->nl_cli.sock); cb = nl_cb_clone(orig_cb); nl_cb_put(orig_cb); if (!cb) return -ENOMEM;
nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, valid_handler, th);
ret = nl_recvmsgs(th->nl_cli.sock, cb); nl_cb_put(cb); if (ret < 0) return -nl2syserr(ret); return check_call_change_handlers(th, TEAM_IFINFO_CHANGE); }
but unfortunately I can't see that the error code is ever printed out anywhere, to narrow down the point of failure. I guess I'll have to follow your suggestion, and compile a version that prints out more info on failure.
Chris
Fri, Oct 09, 2015 at 04:02:48PM CEST, ctcard@hotmail.com wrote:
Wed, Oct 07, 2015 at 10:15:35AM CEST, ctcard@hotmail.com wrote:
Hi,
We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor):
<snip> Is this a known issue?
Not to me.
Need more info, debug messages.
Thanks, I'll try and reproduce with teamd running with debug output
I've tried to reproduce with teamd running with -g, but so far it hasn't happened again. My guess is that it is a timing issue/race condition, so making teamd run with debug out may even slow it down enough to stop it happening. Can you give me any idea what teamd might be racing against?
No clue, this is quite odd. Try to compile teamd yourself and find out where exactly the error happens.
I just had this occur again, this time with teamd running with debug output, but there was no useful extra information that I could see. I did look at the source code for teamd, and the error occurs when this function fails:
int get_ifinfo_list(struct team_handle *th) { struct nl_cb *cb; struct nl_cb *orig_cb; struct rtgenmsg rt_hdr = { .rtgen_family = AF_UNSPEC, }; int ret;
ret = nl_send_simple(th->nl_cli.sock, RTM_GETLINK, NLM_F_DUMP, &rt_hdr, sizeof(rt_hdr)); if (ret < 0) return -nl2syserr(ret); orig_cb = nl_socket_get_cb(th->nl_cli.sock); cb = nl_cb_clone(orig_cb); nl_cb_put(orig_cb); if (!cb) return -ENOMEM;
nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, valid_handler, th);
ret = nl_recvmsgs(th->nl_cli.sock, cb); nl_cb_put(cb); if (ret < 0) return -nl2syserr(ret); return check_call_change_handlers(th, TEAM_IFINFO_CHANGE); }
but unfortunately I can't see that the error code is ever printed out anywhere, to narrow down the point of failure. I guess I'll have to follow your suggestion, and compile a version that prints out more info on failure.
Sure, that's why I asked :)
Fri, Oct 09, 2015 at 04:02:48PM CEST, ctcard@hotmail.com wrote:
Wed, Oct 07, 2015 at 10:15:35AM CEST, ctcard@hotmail.com wrote: >Hi, > >We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor): > > <snip> >Is this a known issue?
Not to me.
Need more info, debug messages.
Thanks, I'll try and reproduce with teamd running with debug output
I've tried to reproduce with teamd running with -g, but so far it hasn't happened again. My guess is that it is a timing issue/race condition, so making teamd run with debug out may even slow it down enough to stop it happening. Can you give me any idea what teamd might be racing against?
No clue, this is quite odd. Try to compile teamd yourself and find out where exactly the error happens.
I just had this occur again, this time with teamd running with debug output, but there was no useful extra information that I could see. I did look at the source code for teamd, and the error occurs when this function fails:
int get_ifinfo_list(struct team_handle *th) <snip>
but unfortunately I can't see that the error code is ever printed out anywhere, to narrow down the point of failure. I guess I'll have to follow your suggestion, and compile a version that prints out more info on failure.
Sure, that's why I asked :)
I finally had it fail again with my extra logging, and the error I see is this:
get_ifinfo_list: nl_recvmsgs failed: ret = -33
This is my patched code: int get_ifinfo_list(struct team_handle *th) { struct nl_cb *cb; struct nl_cb *orig_cb; struct rtgenmsg rt_hdr = { .rtgen_family = AF_UNSPEC, }; int ret;
ret = nl_send_simple(th->nl_cli.sock, RTM_GETLINK, NLM_F_DUMP, &rt_hdr, sizeof(rt_hdr)); if (ret < 0) { err(th, "get_ifinfo_list: nl_send_simple failed: ret = %d", ret); return -nl2syserr(ret); } orig_cb = nl_socket_get_cb(th->nl_cli.sock); cb = nl_cb_clone(orig_cb); nl_cb_put(orig_cb); if (!cb) { err(th, "get_ifinfo_list: nl_cb_clone failed"); return -ENOMEM; }
nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, valid_handler, th);
ret = nl_recvmsgs(th->nl_cli.sock, cb); nl_cb_put(cb); if (ret < 0) { err(th, "get_ifinfo_list: nl_recvmsgs failed: ret = %d", ret); return -nl2syserr(ret); } ret = check_call_change_handlers(th, TEAM_IFINFO_CHANGE); if (ret < 0) { err(th, "get_ifinfo_list: check_call_change_handers failed: ret = %d", ret); } return ret; }
and 33 appears to be the error code NLE_DUMP_INTR from /usr/include/libnl3/netlink/errno.h Does that make any clearer what might be happening?
Chris
Mon, Oct 19, 2015 at 09:27:25AM CEST, ctcard@hotmail.com wrote:
Fri, Oct 09, 2015 at 04:02:48PM CEST, ctcard@hotmail.com wrote:
> Wed, Oct 07, 2015 at 10:15:35AM CEST, ctcard@hotmail.com wrote: >>Hi, >> >>We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor): >> >> <snip> >>Is this a known issue? > > Not to me. > > Need more info, debug messages. Thanks, I'll try and reproduce with teamd running with debug output
I've tried to reproduce with teamd running with -g, but so far it hasn't happened again. My guess is that it is a timing issue/race condition, so making teamd run with debug out may even slow it down enough to stop it happening. Can you give me any idea what teamd might be racing against?
No clue, this is quite odd. Try to compile teamd yourself and find out where exactly the error happens.
I just had this occur again, this time with teamd running with debug output, but there was no useful extra information that I could see. I did look at the source code for teamd, and the error occurs when this function fails:
int get_ifinfo_list(struct team_handle *th) <snip>
but unfortunately I can't see that the error code is ever printed out anywhere, to narrow down the point of failure. I guess I'll have to follow your suggestion, and compile a version that prints out more info on failure.
Sure, that's why I asked :)
I finally had it fail again with my extra logging, and the error I see is this:
get_ifinfo_list: nl_recvmsgs failed: ret = -33
This is my patched code: int get_ifinfo_list(struct team_handle *th) { struct nl_cb *cb; struct nl_cb *orig_cb; struct rtgenmsg rt_hdr = { .rtgen_family = AF_UNSPEC, }; int ret;
ret = nl_send_simple(th->nl_cli.sock, RTM_GETLINK, NLM_F_DUMP, &rt_hdr, sizeof(rt_hdr)); if (ret < 0) { err(th, "get_ifinfo_list: nl_send_simple failed: ret = %d", ret); return -nl2syserr(ret); } orig_cb = nl_socket_get_cb(th->nl_cli.sock); cb = nl_cb_clone(orig_cb); nl_cb_put(orig_cb); if (!cb) { err(th, "get_ifinfo_list: nl_cb_clone failed"); return -ENOMEM; }
nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, valid_handler, th);
ret = nl_recvmsgs(th->nl_cli.sock, cb); nl_cb_put(cb); if (ret < 0) { err(th, "get_ifinfo_list: nl_recvmsgs failed: ret = %d", ret); return -nl2syserr(ret); } ret = check_call_change_handlers(th, TEAM_IFINFO_CHANGE); if (ret < 0) { err(th, "get_ifinfo_list: check_call_change_handers failed: ret = %d", ret); } return ret; }
and 33 appears to be the error code NLE_DUMP_INTR from /usr/include/libnl3/netlink/errno.h Does that make any clearer what might be happening?
That looks like is might be bug inside libnl. What version do you use?
Chris
>>>Hi, >>> >>>We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor): >>> >>> <snip> >>>Is this a known issue? >> >> Not to me. >> >> Need more info, debug messages. > Thanks, I'll try and reproduce with teamd running with debug output I've tried to reproduce with teamd running with -g, but so far it hasn't happened again. My guess is that it is a timing issue/race condition, so making teamd run with debug out may even slow it down enough to stop it happening. Can you give me any idea what teamd might be racing against?
No clue, this is quite odd. Try to compile teamd yourself and find out where exactly the error happens.
I just had this occur again, this time with teamd running with debug output, but there was no useful extra information that I could see. I did look at the source code for teamd, and the error occurs when this function fails:
int get_ifinfo_list(struct team_handle *th)
<snip>
but unfortunately I can't see that the error code is ever printed out anywhere, to narrow down the point of failure. I guess I'll have to follow your suggestion, and compile a version that prints out more info on failure.
Sure, that's why I asked :)
I finally had it fail again with my extra logging, and the error I see is this:
get_ifinfo_list: nl_recvmsgs failed: ret = -33
This is my patched code: int get_ifinfo_list(struct team_handle *th) { struct nl_cb *cb; struct nl_cb *orig_cb; struct rtgenmsg rt_hdr = { .rtgen_family = AF_UNSPEC, }; int ret;
ret = nl_send_simple(th->nl_cli.sock, RTM_GETLINK, NLM_F_DUMP, &rt_hdr, sizeof(rt_hdr)); if (ret < 0) { err(th, "get_ifinfo_list: nl_send_simple failed: ret = %d", ret); return -nl2syserr(ret); } orig_cb = nl_socket_get_cb(th->nl_cli.sock); cb = nl_cb_clone(orig_cb); nl_cb_put(orig_cb); if (!cb) { err(th, "get_ifinfo_list: nl_cb_clone failed"); return -ENOMEM; }
nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, valid_handler, th);
ret = nl_recvmsgs(th->nl_cli.sock, cb); nl_cb_put(cb); if (ret < 0) { err(th, "get_ifinfo_list: nl_recvmsgs failed: ret = %d", ret); return -nl2syserr(ret); } ret = check_call_change_handlers(th, TEAM_IFINFO_CHANGE); if (ret < 0) { err(th, "get_ifinfo_list: check_call_change_handers failed: ret = %d", ret); } return ret; }
and 33 appears to be the error code NLE_DUMP_INTR from /usr/include/libnl3/netlink/errno.h Does that make any clearer what might be happening?
That looks like is might be bug inside libnl. What version do you use?
we are using libnl3-3.2.21-8.el7.x86_64 Looking at the source code of nl, I can see this in recvmsgs:
if (hdr->nlmsg_flags & NLM_F_DUMP_INTR) { if (cb->cb_set[NL_CB_DUMP_INTR]) NL_CB_CALL(cb, NL_CB_DUMP_INTR, msg); else { /* * We have to continue reading to clear * all messages until a NLMSG_DONE is * received and report the inconsistency. */ interrupted = 1; } } ... if (interrupted) err = -NLE_DUMP_INTR;
if (!err) err = nrecv;
return err; } so should libteam be handling NLE_DUMP_INTR?
Chris
Mon, Oct 19, 2015 at 09:44:12AM CEST, ctcard@hotmail.com wrote:
>>>>Hi, >>>> >>>>We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor): >>>> >>>> <snip> >>>>Is this a known issue? >>> >>> Not to me. >>> >>> Need more info, debug messages. >> Thanks, I'll try and reproduce with teamd running with debug output >I've tried to reproduce with teamd running with -g, but so far it hasn't happened again. >My guess is that it is a timing issue/race condition, so making teamd run with debug out may even slow it down enough to stop it happening. >Can you give me any idea what teamd might be racing against?
No clue, this is quite odd. Try to compile teamd yourself and find out where exactly the error happens.
I just had this occur again, this time with teamd running with debug output, but there was no useful extra information that I could see. I did look at the source code for teamd, and the error occurs when this function fails:
int get_ifinfo_list(struct team_handle *th)
<snip>
but unfortunately I can't see that the error code is ever printed out anywhere, to narrow down the point of failure. I guess I'll have to follow your suggestion, and compile a version that prints out more info on failure.
Sure, that's why I asked :)
I finally had it fail again with my extra logging, and the error I see is this:
get_ifinfo_list: nl_recvmsgs failed: ret = -33
This is my patched code: int get_ifinfo_list(struct team_handle *th) { struct nl_cb *cb; struct nl_cb *orig_cb; struct rtgenmsg rt_hdr = { .rtgen_family = AF_UNSPEC, }; int ret;
ret = nl_send_simple(th->nl_cli.sock, RTM_GETLINK, NLM_F_DUMP, &rt_hdr, sizeof(rt_hdr)); if (ret < 0) { err(th, "get_ifinfo_list: nl_send_simple failed: ret = %d", ret); return -nl2syserr(ret); } orig_cb = nl_socket_get_cb(th->nl_cli.sock); cb = nl_cb_clone(orig_cb); nl_cb_put(orig_cb); if (!cb) { err(th, "get_ifinfo_list: nl_cb_clone failed"); return -ENOMEM; }
nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, valid_handler, th);
ret = nl_recvmsgs(th->nl_cli.sock, cb); nl_cb_put(cb); if (ret < 0) { err(th, "get_ifinfo_list: nl_recvmsgs failed: ret = %d", ret); return -nl2syserr(ret); } ret = check_call_change_handlers(th, TEAM_IFINFO_CHANGE); if (ret < 0) { err(th, "get_ifinfo_list: check_call_change_handers failed: ret = %d", ret); } return ret; }
and 33 appears to be the error code NLE_DUMP_INTR from /usr/include/libnl3/netlink/errno.h Does that make any clearer what might be happening?
That looks like is might be bug inside libnl. What version do you use?
we are using libnl3-3.2.21-8.el7.x86_64 Looking at the source code of nl, I can see this in recvmsgs:
if (hdr->nlmsg_flags & NLM_F_DUMP_INTR) { if (cb->cb_set[NL_CB_DUMP_INTR]) NL_CB_CALL(cb, NL_CB_DUMP_INTR, msg); else { /* * We have to continue reading to clear * all messages until a NLMSG_DONE is * received and report the inconsistency. */ interrupted = 1; } } ... if (interrupted) err = -NLE_DUMP_INTR;
if (!err) err = nrecv;
return err; } so should libteam be handling NLE_DUMP_INTR?
Yes, that looks like a way to go. Do you want to cook-up some patch to fix this?
On Wed, Oct 07, 2015 at 08:15:35AM +0000, Chris Card wrote:
Hi,
We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor):
Oct 6 23:36:21 ****** ovs-ctl[623]: Starting ovsdb-server [ OK ] Oct 6 23:36:22 ****** ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.6.2 Oct 6 23:36:22 ****** ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=2.3.1 "external-ids:system-id="47ff9309-5609-47e0-819c-b9055b25edbb"" "system-type="CentOS"" "system-version="7.1.1503-Core"" Oct 6 23:36:22 ****** ovs-ctl[623]: Configuring Open vSwitch system IDs [ OK ] Oct 6 23:36:22 ****** network[733]: Bringing up loopback interface: [ OK ] Oct 6 23:36:22 ****** kernel: [ 6.158533] gre: GRE over IPv4 demultiplexor driver Oct 6 23:36:22 ****** systemd[1]: Starting system-teamd.slice. Oct 6 23:36:22 ****** systemd[1]: Created slice system-teamd.slice. Oct 6 23:36:22 ****** systemd[1]: Starting Team Daemon for device bond0... Oct 6 23:36:22 ****** kernel: [ 6.199635] openvswitch: Open vSwitch switching datapath Oct 6 23:36:22 ****** ovs-ctl[623]: Inserting openvswitch module [ OK ] Oct 6 23:36:22 ****** kernel: [ 6.338577] device ovs-system entered promiscuous mode Oct 6 23:36:22 ****** kernel: [ 6.340086] openvswitch: netlink: Unknown key attribute (type=62, max=21). Oct 6 23:36:22 ****** kernel: [ 6.385293] device br-ex entered promiscuous mode Oct 6 23:36:22 ****** kernel: [ 6.426511] device br-int entered promiscuous mode Oct 6 23:36:22 ****** teamd[857]: Failed to get interface information list. Oct 6 23:36:22 ****** teamd[857]: Failed to init interface information list. Oct 6 23:36:22 ****** teamd[857]: Team init failed. Oct 6 23:36:22 ****** teamd[857]: teamd_init() failed. Oct 6 23:36:22 ****** teamd[857]: Failed: Invalid argument Oct 6 23:36:22 ****** systemd[1]: teamd@bond0.service: main process exited, code=exited, status=1/FAILURE Oct 6 23:36:22 ****** network[733]: Bringing up interface bond0: Job for teamd@bond0.service failed. See 'systemctl status teamd@bond0.service' and 'journalctl -xn' for details.
systemd is telling you the command to get more info. I suspect that the bond0 is using ports that might not be available at the time teamd is initializing.
fbl
Oct 6 23:36:22 ****** kernel: [ 6.433515] device br-tun entered promiscuous mode Oct 6 23:36:22 ****** systemd[1]: Unit teamd@bond0.service entered failed state. Oct 6 23:36:22 ****** ovs-ctl[623]: Starting ovs-vswitchd [ OK ] Oct 6 23:36:22 ****** network[733]: [FAILED] Oct 6 23:36:22 ****** ovs-ctl[623]: Enabling remote OVSDB managers [ OK ]
Is this a known issue?
Chris
libteam mailing list libteam@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/libteam
libteam@lists.fedorahosted.org