Fri, Oct 09, 2015 at 04:02:48PM CEST, ctcard@hotmail.com wrote:
Wed, Oct 07, 2015 at 10:15:35AM CEST, ctcard@hotmail.com wrote: >Hi, > >We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last night (from /var/log/messages.minor): > > <snip> >Is this a known issue?
Not to me.
Need more info, debug messages.
Thanks, I'll try and reproduce with teamd running with debug output
I've tried to reproduce with teamd running with -g, but so far it hasn't happened again. My guess is that it is a timing issue/race condition, so making teamd run with debug out may even slow it down enough to stop it happening. Can you give me any idea what teamd might be racing against?
No clue, this is quite odd. Try to compile teamd yourself and find out where exactly the error happens.
I just had this occur again, this time with teamd running with debug output, but there was no useful extra information that I could see. I did look at the source code for teamd, and the error occurs when this function fails:
int get_ifinfo_list(struct team_handle *th) <snip>
but unfortunately I can't see that the error code is ever printed out anywhere, to narrow down the point of failure. I guess I'll have to follow your suggestion, and compile a version that prints out more info on failure.
Sure, that's why I asked :)
I finally had it fail again with my extra logging, and the error I see is this:
get_ifinfo_list: nl_recvmsgs failed: ret = -33
This is my patched code: int get_ifinfo_list(struct team_handle *th) { struct nl_cb *cb; struct nl_cb *orig_cb; struct rtgenmsg rt_hdr = { .rtgen_family = AF_UNSPEC, }; int ret;
ret = nl_send_simple(th->nl_cli.sock, RTM_GETLINK, NLM_F_DUMP, &rt_hdr, sizeof(rt_hdr)); if (ret < 0) { err(th, "get_ifinfo_list: nl_send_simple failed: ret = %d", ret); return -nl2syserr(ret); } orig_cb = nl_socket_get_cb(th->nl_cli.sock); cb = nl_cb_clone(orig_cb); nl_cb_put(orig_cb); if (!cb) { err(th, "get_ifinfo_list: nl_cb_clone failed"); return -ENOMEM; }
nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, valid_handler, th);
ret = nl_recvmsgs(th->nl_cli.sock, cb); nl_cb_put(cb); if (ret < 0) { err(th, "get_ifinfo_list: nl_recvmsgs failed: ret = %d", ret); return -nl2syserr(ret); } ret = check_call_change_handlers(th, TEAM_IFINFO_CHANGE); if (ret < 0) { err(th, "get_ifinfo_list: check_call_change_handers failed: ret = %d", ret); } return ret; }
and 33 appears to be the error code NLE_DUMP_INTR from /usr/include/libnl3/netlink/errno.h Does that make any clearer what might be happening?
Chris