Mon, Oct 19, 2015 at 09:44:12AM CEST, ctcard(a)hotmail.com wrote:
>>>>>>>>>Hi,
>>>>>>>>>
>>>>>>>>>We are seeing occasionally seeing issues with the
teaming daemon not starting after a reboot on centos 7 VMs. Here is an example from last
night (from /var/log/messages.minor):
>>>>>>>>>
>>>>>>>>> <snip>
>>>>>>>>>Is this a known issue?
>>>>>>>>
>>>>>>>> Not to me.
>>>>>>>>
>>>>>>>> Need more info, debug messages.
>>>>>>> Thanks, I'll try and reproduce with teamd running with
debug output
>>>>>>I've tried to reproduce with teamd running with -g, but so far
it hasn't happened again.
>>>>>>My guess is that it is a timing issue/race condition, so making
teamd run with debug out may even slow it down enough to stop it happening.
>>>>>>Can you give me any idea what teamd might be racing against?
>>>>>
>>>>> No clue, this is quite odd. Try to compile teamd yourself and find
out
>>>>> where exactly the error happens.
>>>>I just had this occur again, this time with teamd running with debug
output, but there was no useful extra information that I could see.
>>>>I did look at the source code for teamd, and the error occurs when this
function fails:
>>>>
>>>>int get_ifinfo_list(struct team_handle *th)
>>>> <snip>
>>>>
>>>>but unfortunately I can't see that the error code is ever printed out
anywhere, to narrow down the point of failure.
>>>>I guess I'll have to follow your suggestion, and compile a version
that prints out more info on failure.
>>>
>>> Sure, that's why I asked :)
>>I finally had it fail again with my extra logging, and the error I see is this:
>>
>>get_ifinfo_list: nl_recvmsgs failed: ret = -33
>>
>>This is my patched code:
>>int get_ifinfo_list(struct team_handle *th)
>>{
>> struct nl_cb *cb;
>> struct nl_cb *orig_cb;
>> struct rtgenmsg rt_hdr = {
>> .rtgen_family = AF_UNSPEC,
>> };
>> int ret;
>>
>> ret = nl_send_simple(th->nl_cli.sock, RTM_GETLINK, NLM_F_DUMP,
>> &rt_hdr, sizeof(rt_hdr));
>> if (ret < 0)
>> {
>> err(th, "get_ifinfo_list: nl_send_simple failed: ret = %d", ret);
>> return -nl2syserr(ret);
>> }
>> orig_cb = nl_socket_get_cb(th->nl_cli.sock);
>> cb = nl_cb_clone(orig_cb);
>> nl_cb_put(orig_cb);
>> if (!cb)
>> {
>> err(th, "get_ifinfo_list: nl_cb_clone failed");
>> return -ENOMEM;
>> }
>>
>> nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, valid_handler, th);
>>
>> ret = nl_recvmsgs(th->nl_cli.sock, cb);
>> nl_cb_put(cb);
>> if (ret < 0)
>> {
>> err(th, "get_ifinfo_list: nl_recvmsgs failed: ret = %d", ret);
>> return -nl2syserr(ret);
>> }
>> ret = check_call_change_handlers(th, TEAM_IFINFO_CHANGE);
>> if (ret < 0)
>> {
>> err(th, "get_ifinfo_list: check_call_change_handers failed: ret = %d",
ret);
>> }
>> return ret;
>>}
>>
>>and 33 appears to be the error code NLE_DUMP_INTR from
/usr/include/libnl3/netlink/errno.h
>>Does that make any clearer what might be happening?
>
> That looks like is might be bug inside libnl. What version do you use?
>
we are using libnl3-3.2.21-8.el7.x86_64
Looking at the source code of nl, I can see this in recvmsgs:
if (hdr->nlmsg_flags & NLM_F_DUMP_INTR) {
if (cb->cb_set[NL_CB_DUMP_INTR])
NL_CB_CALL(cb, NL_CB_DUMP_INTR, msg);
else {
/*
* We have to continue reading to clear
* all messages until a NLMSG_DONE is
* received and report the inconsistency.
*/
interrupted = 1;
}
}
...
if (interrupted)
err = -NLE_DUMP_INTR;
if (!err)
err = nrecv;
return err;
}
so should libteam be handling NLE_DUMP_INTR?
Yes, that looks like a way to go. Do you want to cook-up some patch to
fix this?