On Tue, 15 Jan 2019 at 05:47, Jiri Pirko <jiri(a)resnulli.us> wrote:
>The team datapath runs lockless, and libteam can enable/disble
port
>one the fly when traffic happens. For exmaple, I have 4 port and I
>want, port0->slave0,
>port1->slave1, port2->slave2 and port3->slave3, when slave1 device is link
>down state, it becomes port0->slave0, port1->slave2, port2->slave3,
>port3->slave0,
>using the ID%num as the hash function.
>
>But it seems if datapath runs lockless, it could see the wrong ifindex
during
>configuration as stated in the comment because the team drvier needs to
>reconstruct the active ports list. So my question is why this is not a
problem
>when datapathsees a wrong ifindex, wouldn't that break the defined team
>policy ? Thank you very much for your help.
Yes, it would. For a packet from time to time. Is it a problem?
Hi Jiri,
Recently one of our customer found packets are sent before a port was
enabled with lacp. Do you think if there is any relation with the lockles
datapath (although I doubt)?
Here is the debug message, you can see eth0 up at 10:17:49, enabled at
10:17:51
Jan 8 10:16:54 localhost teamd: eth1: lacp info state: 0x3D.
Jan 8 10:17:20 localhost teamd: eth0: lacp info state: 0x3D.
Jan 8 10:17:24 localhost teamd: eth1: lacp info state: 0x3D.
Jan 8 10:17:27 localhost teamd: eth0: Setting periodic timer to "fast".
Jan 8 10:17:27 localhost teamd: eth0: Disabling port
Jan 8 10:17:27 localhost teamd: eth0: Changed port state: "current" ->
"expired"
Jan 8 10:17:27 localhost teamd: eth0: lacp info state: 0x8D.
Jan 8 10:17:27 localhost teamd: eth0: lacp info state: 0x8D.
Jan 8 10:17:27 localhost teamd: <port_list>
Jan 8 10:17:27 localhost teamd: 3: eth1: up 10000Mbit FD
Jan 8 10:17:27 localhost teamd: *2: eth0: down 0Mbit HD
Jan 8 10:17:27 localhost teamd: </port_list>
Jan 8 10:17:27 localhost teamd: eth0: Setting periodic timer to "slow".
Jan 8 10:17:27 localhost teamd: eth0: Changed port state: "expired" ->
"disabled"
Jan 8 10:17:27 localhost teamd: eth0: Unselecting LACP port
Jan 8 10:17:27 localhost teamd: eth0: LACP port unselected from aggregator
3
Jan 8 10:17:27 localhost teamd: eth0: lacp info state: 0x05.
Jan 8 10:17:27 localhost teamd: eth0: ethtool-link went down.
Jan 8 10:17:49 localhost teamd: <port_list>
Jan 8 10:17:49 localhost teamd: 3: eth1: up 10000Mbit FD
Jan 8 10:17:49 localhost teamd: *2: eth0: up 10000Mbit FD
Jan 8 10:17:49 localhost teamd: </port_list>
Jan 8 10:17:49 localhost teamd: eth0: Setting periodic timer to "fast".
Jan 8 10:17:49 localhost teamd: eth0: Changed port state: "disabled" ->
"expired"
Jan 8 10:17:49 localhost teamd: eth0: lacp info state: 0x85.
Jan 8 10:17:49 localhost teamd: eth0: lacp info state: 0x85.
Jan 8 10:17:50 localhost teamd: eth0: lacp info state: 0x85.
Jan 8 10:17:51 localhost teamd: eth0: lacp info state: 0x85.
Jan 8 10:17:51 localhost teamd: eth0: Changed port state: "expired" ->
"current"
Jan 8 10:17:51 localhost teamd: eth0: Selecting LACP port
Jan 8 10:17:51 localhost teamd: eth0: LACP port selected into aggregator 3
Jan 8 10:17:51 localhost teamd: eth0: Enabling port
Jan 8 10:17:51 localhost teamd: eth0: lacp info state: 0x3D.
Jan 8 10:17:52 localhost teamd: eth0: ethtool-link went up.
Jan 8 10:17:52 localhost teamd: eth0: lacp info state: 0x3D.
Jan 8 10:17:52 localhost teamd: eth0: Setting periodic timer to "slow".
Jan 8 10:17:52 localhost teamd: eth0: lacp info state: 0x3D.
Jan 8 10:17:54 localhost teamd: eth1: lacp info state: 0x3D.
Jan 8 10:18:22 localhost teamd: eth0: lacp info state: 0x3D.
While from pcap file, eth0 started to send echo reply immediately after
link up, before enabled (10:17:50 - 10:17:51 in pcap file), which is weird.
Note: the libteam that customer used has applied my lacp patch(posted to
libteam list), which will enable port until partner enter Sync state.
10:17:25.221928 172.25.200.130 → 172.25.200.133 ICMP 102 Echo (ping)
request id=0x79c4, seq=11/2816, ttl=64
10:17:25.221944 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply
id=0x79c4, seq=11/2816, ttl=64 (request in 27)
10:17:26.222944 172.25.200.130 → 172.25.200.133 ICMP 102 Echo (ping)
request id=0x79c4, seq=12/3072, ttl=64
10:17:26.222961 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply
id=0x79c4, seq=12/3072, ttl=64 (request in 29)
10:17:27.143002 50:2f:a8:1b:4c:7c → 01:80:c2:00:00:02 LACP 128 v1 ACTOR
00:23:04:ee:be:68 P: 16644 K: 32773 *****GSA PARTNER cc:16:7e:91:0b:6c P: 2
K: 0 **DCSG*A
10:17:27.143783 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR
cc:16:7e:91:0b:6c P: 2 K: 0 E***SG*A PARTNER 00:23:04:ee:be:68 P: 16644 K:
32773 *****GSA
10:17:27.226957 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply
id=0x79c4, seq=13/3328, ttl=64
10:17:27.262630 172.25.200.133 → 172.25.200.178 ICMP 42 Echo (ping) reply
id=0x3cb0, seq=0/0, ttl=64
10:17:49.766487 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR
cc:16:7e:91:0b:6c P: 2 K: 0 E****G*A PARTNER 00:00:00:00:00:00 P: 0 K: 0
******S*
10:17:49.766828 50:2f:a8:1b:4c:7c → 01:80:c2:00:00:02 LACP 128 v1 ACTOR
00:23:04:ee:be:68 P: 16644 K: 32773 *****GSA PARTNER cc:16:7e:91:0b:6c P: 2
K: 0 E****G*A
10:17:50.263981 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply
id=0x79c4, seq=36/9216, ttl=64
10:17:50.766452 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR
cc:16:7e:91:0b:6c P: 2 K: 0 E****G*A PARTNER 00:23:04:ee:be:68 P: 16644 K:
32773 *****GSA
10:17:51.265952 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply
id=0x79c4, seq=37/9472, ttl=64
10:17:51.766699 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR
cc:16:7e:91:0b:6c P: 2 K: 0 E****G*A PARTNER 00:23:04:ee:be:68 P: 16644 K:
32773 *****GSA
10:17:51.767298 50:2f:a8:1b:4c:7c → 01:80:c2:00:00:02 LACP 128 v1 ACTOR
00:23:04:ee:be:68 P: 16644 K: 32773 ****SGSA PARTNER cc:16:7e:91:0b:6c P: 2
K: 0 E****G*A
10:17:52.265954 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply
id=0x79c4, seq=38/9728, ttl=64
10:17:52.766459 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR
cc:16:7e:91:0b:6c P: 2 K: 0 **DCSG*A PARTNER 00:23:04:ee:be:68 P: 16644 K:
32773 ****SGSA
10:17:52.814804 50:2f:a8:1b:4c:7c → 01:80:c2:00:00:02 LACP 128 v1 ACTOR
00:23:04:ee:be:68 P: 16644 K: 32773 ***CSGSA PARTNER cc:16:7e:91:0b:6c P: 2
K: 0 **DCSG*A
10:17:52.822757 50:2f:a8:1b:4c:7c → 01:80:c2:00:00:02 LACP 128 v1 ACTOR
00:23:04:ee:be:68 P: 16644 K: 32773 **DCSG*A PARTNER cc:16:7e:91:0b:6c P: 2
K: 0 **DCSG*A
10:17:52.822814 cc:16:7e:91:0b:6c → 01:80:c2:00:00:02 LACP 124 v1 ACTOR
cc:16:7e:91:0b:6c P: 2 K: 0 **DCSG*A PARTNER 00:23:04:ee:be:68 P: 16644 K:
32773 **DCSG*A
10:17:53.265957 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply
id=0x79c4, seq=39/9984, ttl=64
10:17:54.265952 172.25.200.130 → 172.25.200.133 ICMP 102 Echo (ping)
request id=0x79c4, seq=40/10240, ttl=64
10:17:54.265968 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply
id=0x79c4, seq=40/10240, ttl=64 (request in 48)
10:17:55.270934 172.25.200.130 → 172.25.200.133 ICMP 102 Echo (ping)
request id=0x79c4, seq=41/10496, ttl=64
10:17:55.270950 172.25.200.133 → 172.25.200.130 ICMP 98 Echo (ping) reply
id=0x79c4, seq=41/10496, ttl=64 (request in 50)
The only possibility we come up with is that it may caused by
the team_queue_override_transmit(), So I gave customer a patch
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -811,8 +811,11 @@ static bool team_queue_override_transmit(struct team
*team, struct sk_buff *skb)
return false;
qom_list = __team_get_qom_list(team, skb->queue_mapping);
list_for_each_entry_rcu(port, qom_list, qom_list) {
- if (!team_dev_queue_xmit(team, port, skb))
+ if (team_port_enabled(port) &&
+ !team_dev_queue_xmit(team, port, skb))
return true;
}
return false;
}
But I don't have much confidence. Do you have any other idea?
Thanks
Hangbin