Mon, Oct 19, 2015 at 09:27:25AM CEST, ctcard(a)hotmail.com wrote:
> Fri, Oct 09, 2015 at 04:02:48PM CEST, ctcard(a)hotmail.com wrote:
>>
>>
>>>>>> Wed, Oct 07, 2015 at 10:15:35AM CEST, ctcard(a)hotmail.com wrote:
>>>>>>>Hi,
>>>>>>>
>>>>>>>We are seeing occasionally seeing issues with the teaming
daemon not starting after a reboot on centos 7 VMs. Here is an example from last night
(from /var/log/messages.minor):
>>>>>>>
>>>>>>> <snip>
>>>>>>>Is this a known issue?
>>>>>>
>>>>>> Not to me.
>>>>>>
>>>>>> Need more info, debug messages.
>>>>> Thanks, I'll try and reproduce with teamd running with debug
output
>>>>I've tried to reproduce with teamd running with -g, but so far it
hasn't happened again.
>>>>My guess is that it is a timing issue/race condition, so making teamd run
with debug out may even slow it down enough to stop it happening.
>>>>Can you give me any idea what teamd might be racing against?
>>>
>>> No clue, this is quite odd. Try to compile teamd yourself and find out
>>> where exactly the error happens.
>>I just had this occur again, this time with teamd running with debug output, but
there was no useful extra information that I could see.
>>I did look at the source code for teamd, and the error occurs when this function
fails:
>>
>>int get_ifinfo_list(struct team_handle *th)
>> <snip>
>>
>>but unfortunately I can't see that the error code is ever printed out
anywhere, to narrow down the point of failure.
>>I guess I'll have to follow your suggestion, and compile a version that prints
out more info on failure.
>
> Sure, that's why I asked :)
I finally had it fail again with my extra logging, and the error I see is this:
get_ifinfo_list: nl_recvmsgs failed: ret = -33
This is my patched code:
int get_ifinfo_list(struct team_handle *th)
{
struct nl_cb *cb;
struct nl_cb *orig_cb;
struct rtgenmsg rt_hdr = {
.rtgen_family = AF_UNSPEC,
};
int ret;
ret = nl_send_simple(th->nl_cli.sock, RTM_GETLINK, NLM_F_DUMP,
&rt_hdr, sizeof(rt_hdr));
if (ret < 0)
{
err(th, "get_ifinfo_list: nl_send_simple failed: ret = %d", ret);
return -nl2syserr(ret);
}
orig_cb = nl_socket_get_cb(th->nl_cli.sock);
cb = nl_cb_clone(orig_cb);
nl_cb_put(orig_cb);
if (!cb)
{
err(th, "get_ifinfo_list: nl_cb_clone failed");
return -ENOMEM;
}
nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, valid_handler, th);
ret = nl_recvmsgs(th->nl_cli.sock, cb);
nl_cb_put(cb);
if (ret < 0)
{
err(th, "get_ifinfo_list: nl_recvmsgs failed: ret = %d", ret);
return -nl2syserr(ret);
}
ret = check_call_change_handlers(th, TEAM_IFINFO_CHANGE);
if (ret < 0)
{
err(th, "get_ifinfo_list: check_call_change_handers failed: ret = %d",
ret);
}
return ret;
}
and 33 appears to be the error code NLE_DUMP_INTR
from /usr/include/libnl3/netlink/errno.h
Does that make any clearer what might be happening?
That looks like is might be bug inside libnl. What version do you use?
Chris