On Tue, 15 Jan 2019 at 05:47, Jiri Pirko <jiri@resnulli.us> wrote:
>The team datapath runs lockless, and libteam can enable/disble port
>one the fly when traffic happens. For exmaple,  I have 4 port and I
>want, port0->slave0,
>port1->slave1, port2->slave2 and port3->slave3, when slave1 device is link
>down state,  it becomes port0->slave0, port1->slave2, port2->slave3,
>port3->slave0,
>using the ID%num as the hash function.
>
>But it seems if datapath runs lockless, it could see the wrong ifindex during
>configuration as stated in the comment because the team drvier needs to
>reconstruct the active ports list. So my question is why this is not a problem
>when datapathsees a wrong ifindex, wouldn't that break the defined team
>policy ?  Thank you very much for your help.

Yes, it would. For a packet from time to time. Is it a problem?

Hi Jiri,

Recently one of our customer found packets are sent before a port was enabled with lacp. Do you think if there is any relation with the lockles datapath (although I doubt)?

Here is the debug message, you can see eth0 up at 10:17:49, enabled at 10:17:51

Jan  8 10:16:54 localhost teamd: eth1: lacp info state: 0x3D.
Jan  8 10:17:20 localhost teamd: eth0: lacp info state: 0x3D.
Jan  8 10:17:24 localhost teamd: eth1: lacp info state: 0x3D.
Jan  8 10:17:27 localhost teamd: eth0: Setting periodic timer to "fast".
Jan  8 10:17:27 localhost teamd: eth0: Disabling port
Jan  8 10:17:27 localhost teamd: eth0: Changed port state: "current" -> "expired"
Jan  8 10:17:27 localhost teamd: eth0: lacp info state: 0x8D.
Jan  8 10:17:27 localhost teamd: eth0: lacp info state: 0x8D.
Jan  8 10:17:27 localhost teamd: <port_list>
Jan  8 10:17:27 localhost teamd: 3: eth1: up 10000Mbit FD
Jan  8 10:17:27 localhost teamd: *2: eth0: down 0Mbit HD
Jan  8 10:17:27 localhost teamd: </port_list>
Jan  8 10:17:27 localhost teamd: eth0: Setting periodic timer to "slow".
Jan  8 10:17:27 localhost teamd: eth0: Changed port state: "expired" -> "disabled"
Jan  8 10:17:27 localhost teamd: eth0: Unselecting LACP port
Jan  8 10:17:27 localhost teamd: eth0: LACP port unselected from aggregator 3
Jan  8 10:17:27 localhost teamd: eth0: lacp info state: 0x05.
Jan  8 10:17:27 localhost teamd: eth0: ethtool-link went down.
Jan  8 10:17:49 localhost teamd: <port_list>
Jan  8 10:17:49 localhost teamd: 3: eth1: up 10000Mbit FD
Jan  8 10:17:49 localhost teamd: *2: eth0: up 10000Mbit FD
Jan  8 10:17:49 localhost teamd: </port_list>
Jan  8 10:17:49 localhost teamd: eth0: Setting periodic timer to "fast".
Jan  8 10:17:49 localhost teamd: eth0: Changed port state: "disabled" -> "expired"
Jan  8 10:17:49 localhost teamd: eth0: lacp info state: 0x85.
Jan  8 10:17:49 localhost teamd: eth0: lacp info state: 0x85.
Jan  8 10:17:50 localhost teamd: eth0: lacp info state: 0x85.
Jan  8 10:17:51 localhost teamd: eth0: lacp info state: 0x85.
Jan  8 10:17:51 localhost teamd: eth0: Changed port state: "expired" -> "current"
Jan  8 10:17:51 localhost teamd: eth0: Selecting LACP port
Jan  8 10:17:51 localhost teamd: eth0: LACP port selected into aggregator 3
Jan  8 10:17:51 localhost teamd: eth0: Enabling port
Jan  8 10:17:51 localhost teamd: eth0: lacp info state: 0x3D.
Jan  8 10:17:52 localhost teamd: eth0: ethtool-link went up.
Jan  8 10:17:52 localhost teamd: eth0: lacp info state: 0x3D.
Jan  8 10:17:52 localhost teamd: eth0: Setting periodic timer to "slow".
Jan  8 10:17:52 localhost teamd: eth0: lacp info state: 0x3D.
Jan  8 10:17:54 localhost teamd: eth1: lacp info state: 0x3D.
Jan  8 10:18:22 localhost teamd: eth0: lacp info state: 0x3D.

While from pcap file, eth0 started to send echo reply immediately after link up, before enabled (10:17:50 - 10:17:51 in pcap file), which is weird.

Note: the libteam that customer used has applied my lacp patch(posted to libteam list), which will enable port until partner enter Sync state.

10:17:25.221928 172.25.200.130 → 172.25.200.133 ICMP 102 Echo (ping) request  id=0x79c4, seq=11/2816, ttl=64
10:17:25.221944 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply    id=0x79c4, seq=11/2816, ttl=64 (request in 27)
10:17:26.222944 172.25.200.130 → 172.25.200.133 ICMP 102 Echo (ping) request  id=0x79c4, seq=12/3072, ttl=64
10:17:26.222961 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply    id=0x79c4, seq=12/3072, ttl=64 (request in 29)
10:17:27.143002 50:2f:a8:1b:4c:7c → 01:80:c2:00:00:02 LACP 128 v1 ACTOR 00:23:04:ee:be:68 P: 16644 K: 32773 *****GSA PARTNER cc:16:7e:91:0b:6c P: 2 K: 0 **DCSG*A
10:17:27.143783 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR cc:16:7e:91:0b:6c P: 2 K: 0 E***SG*A PARTNER 00:23:04:ee:be:68 P: 16644 K: 32773 *****GSA
10:17:27.226957 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply    id=0x79c4, seq=13/3328, ttl=64
10:17:27.262630 172.25.200.133 → 172.25.200.178 ICMP 42 Echo (ping) reply    id=0x3cb0, seq=0/0, ttl=64
10:17:49.766487 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR cc:16:7e:91:0b:6c P: 2 K: 0 E****G*A PARTNER 00:00:00:00:00:00 P: 0 K: 0 ******S*
10:17:49.766828 50:2f:a8:1b:4c:7c → 01:80:c2:00:00:02 LACP 128 v1 ACTOR 00:23:04:ee:be:68 P: 16644 K: 32773 *****GSA PARTNER cc:16:7e:91:0b:6c P: 2 K: 0 E****G*A
10:17:50.263981 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply    id=0x79c4, seq=36/9216, ttl=64
10:17:50.766452 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR cc:16:7e:91:0b:6c P: 2 K: 0 E****G*A PARTNER 00:23:04:ee:be:68 P: 16644 K: 32773 *****GSA
10:17:51.265952 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply    id=0x79c4, seq=37/9472, ttl=64
10:17:51.766699 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR cc:16:7e:91:0b:6c P: 2 K: 0 E****G*A PARTNER 00:23:04:ee:be:68 P: 16644 K: 32773 *****GSA
10:17:51.767298 50:2f:a8:1b:4c:7c → 01:80:c2:00:00:02 LACP 128 v1 ACTOR 00:23:04:ee:be:68 P: 16644 K: 32773 ****SGSA PARTNER cc:16:7e:91:0b:6c P: 2 K: 0 E****G*A
10:17:52.265954 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply    id=0x79c4, seq=38/9728, ttl=64
10:17:52.766459 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR cc:16:7e:91:0b:6c P: 2 K: 0 **DCSG*A PARTNER 00:23:04:ee:be:68 P: 16644 K: 32773 ****SGSA
10:17:52.814804 50:2f:a8:1b:4c:7c → 01:80:c2:00:00:02 LACP 128 v1 ACTOR 00:23:04:ee:be:68 P: 16644 K: 32773 ***CSGSA PARTNER cc:16:7e:91:0b:6c P: 2 K: 0 **DCSG*A
10:17:52.822757 50:2f:a8:1b:4c:7c → 01:80:c2:00:00:02 LACP 128 v1 ACTOR 00:23:04:ee:be:68 P: 16644 K: 32773 **DCSG*A PARTNER cc:16:7e:91:0b:6c P: 2 K: 0 **DCSG*A
10:17:52.822814 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR cc:16:7e:91:0b:6c P: 2 K: 0 **DCSG*A PARTNER 00:23:04:ee:be:68 P: 16644 K: 32773 **DCSG*A
10:17:53.265957 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply    id=0x79c4, seq=39/9984, ttl=64
10:17:54.265952 172.25.200.130 → 172.25.200.133 ICMP 102 Echo (ping) request  id=0x79c4, seq=40/10240, ttl=64
10:17:54.265968 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply    id=0x79c4, seq=40/10240, ttl=64 (request in 48)
10:17:55.270934 172.25.200.130 → 172.25.200.133 ICMP 102 Echo (ping) request  id=0x79c4, seq=41/10496, ttl=64
10:17:55.270950 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply    id=0x79c4, seq=41/10496, ttl=64 (request in 50)

The only possibility we come up with is that it may caused by the team_queue_override_transmit(), So I gave customer a patch 

--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -811,8 +811,11 @@ static bool team_queue_override_transmit(struct team *team, struct sk_buff *skb)
                return false;
        qom_list = __team_get_qom_list(team, skb->queue_mapping);
        list_for_each_entry_rcu(port, qom_list, qom_list) {
-               if (!team_dev_queue_xmit(team, port, skb))
+               if (team_port_enabled(port) &&
+                   !team_dev_queue_xmit(team, port, skb))
                        return true;
        }
        return false;
 }

But I don't have much confidence. Do you have any other idea?

Thanks
Hangbin