Hi.
Little more info about that fix (or rather suggestion for one). Here's a simple example to reproduce the problem (note: needs SMP).
Manually create team device, add bunch of VLANs and put it up (required to be up first in this example), then start teamd with --take-over:
root@debian:~# ip link add name team0 type team root@debian:~# for i in `seq 100 150` ; do ip link add link team0 name team0.$i type vlan id $i ; done root@debian:~# ip link set team0 up root@debian:~# cat teamd.conf { "device": "team0", "runner": { "name": "activebackup" }, "ports": { "eth1": {}, "eth2": {} } } root@debian:~# teamd -o -N -f teamd.conf This program is not intended to be run as root. 1.26 successfully started. eth1: ethtool-link went up. eth1: Can't get port priority. Using default. Changed active port to "eth1". eth2: ethtool-link went up.
It doesn't complain that anything is wrong. But when checking state:
root@debian:~# teamdctl team0 state setup: runner: activebackup ports: eth1 link watches: link summary: up instance[link_watch_0]: name: ethtool link: up down count: 0 Failed to parse JSON port dump. command call failed (Invalid argument)
Doesn't look healthy...
root@debian:~# teamdctl team0 state dump { "ports": { "eth1": { "ifinfo": { "dev_addr": "08:00:27:81:95:cf", "dev_addr_len": 6, "ifindex": 3, "ifname": "eth1" }, "link": { "duplex": "full", "speed": 1000, "up": true }, "link_watches": { "list": { "link_watch_0": { "delay_down": 0, "delay_up": 0, "down_count": 0, "name": "ethtool", "up": true } }, }, "eth2": { "link_watches": { "list": { "link_watch_0": { "delay_down": 0, "delay_up": 0, "down_count": 0, "name": "ethtool", "up": true } } } } }, "runner": { "active_port": "eth1" }, ...
Port eth2 is missing all its ifinfo and it's not properly linked to aggregate. Reason for this can be seen if running strace:
recvmsg(8, 0x7ffcb3805ca0, 0) = -1 ENOBUFS (No buffer space available)
That's the socket that joined to RTNLGRP_LINK notifications. When teamd started, all VLANs under it got carrier up, and kernel flooded events faster than teamd could read them. In this example, it now lost events related to port eth2.
That socket uses libnl default 32k buffer size. Netlink messages are quite large (over 1k), and the buffer gets easily full. Kernel neither knows nor cares were those broadcast messages actually delivered. This cannot really be fixed by simply increasing the buffer size, as there's no magical size X that is enough for all use cases. Also the system-wide rmem_max limit is quite low (usually 100-200k), and this easily requires several megabytes of buffer when there are hundreds of VLANs.
I don't think there's any elegant way to recover from this situation. All ifinfo is invalidated at this point and must be refreshed. It cannot just refresh team device and its ports, since library side might not have them linked because events missed, and it doesn't know about teamd configuration.
-antti