Wed, Sep 02, 2020 at 12:06:15PM CEST, lucien.xin(a)gmail.com wrote:
On Fri, Nov 29, 2019 at 12:35 AM Jiri Pirko <jiri(a)resnulli.us>
wrote:
>
> Wed, Nov 13, 2019 at 02:26:47PM CET, petrm(a)mellanox.com wrote:
> >On systems where carrier is gained very quickly, there is a race between
> >teamd and the kernel that sometimes leads to all team slaves being stuck in
> >enabled=false state.
> >
> >When a port is enslaved to a team device, the kernel sends a netlink
> >message marking the port as enabled. teamd's lb_event_watch_port_added()
> >calls team_set_port_enabled(false), because link is down at that point. The
> >kernel responds with a message marking the port as disabled. At this point,
> >there are two outstanding messages: the initial one marking port as
> >enabled, and the second one marking it as disabled. teamd has not processed
> >either of these.
> >
> >Next teamd gets the netlink message that sets enabled=true, and updates its
> >internal cache accordingly. If at this point ethtool link-watch wakes up,
> >teamd considers (in teamd_port_check_enable()) enabling the port. After
> >consulting the cache, it concludes the port is already up, and neglects to
> >do so. Only then does teamd get the netlink message informing it of setting
> >enabled=false.
> >
> >The problem is that the teamd cache is not synchronous with respect to the
> >kernel state. If the carrier takes a while to come up (as is normally the
> >case), this is not a problem, because teamd caches up quickly enough. But
> >this may not always be the case, and particularly on a simulated system,
> >the carrier is gained almost immediately.
Hi, Petr
We've reverted this patch due to a regression, now I'm thinking to fix
it in another way if it still exists.
do you have a reproducer for this race issue?
Hi. Any progress with this?
>
>Thanks.