libteam January 2015

libteam@lists.fedorahosted.org

2 participants
3 discussions

Stange problems with CentOS 7 and roundrobin runner

by Ingo Brand

Hello, I am currently setting up a high available NFS server cluster with 2 nodes using pacemaker + corosync + drbd + flashcache. The two nodes (media1 and media2) have 4 1G copper nics each. I formed 2 teams: team0: LACP with em1 and em2 connected to a switch for NFS services team1: roundrobin with em3 and em4 and MTU 9000 directly connected with copper patch cables for drbd sync During my testing everything was working as expected. Then I started to tranfer about 20TB from an old server to the new cluster. The copy process was fine for about 32 hours. Then corosync and drbd started to report errors on the roundrobin team1 interface (see logs at the end of this email). The drbd sync and corosync started to flap between states for several hours on team1. I checked the error counters with ifconfig and found a huge amount off TX dropped packets on team1 of the drbd slave node media2. There were NO other error counters raising but team1 on media2. To make this a bit more clear: - team1 on media2 (the drbd slave) showed raising TX dropped packets - em3 and em4 on media2 (the physical interfaces of team1) did not show any raising error counters. - team1 on media1 (the drbd master) did not show any raising error counters. - em3 and em4 on media1 (the physical interfaces of team1) did not show any raising error counters. I then thought maybe one of the two patch cables between the two machines was not 100% ok. So I removed em4 from team1 on both machines: "teamdctl team1 port remove em4" The error counters stopped raising. I thought: "Yeah! That cable must be broken!" But just to make sure I really found the problem I re-enabled em4 on both nodes: "teamdctl team1 port add em4" Ok, now both team1 interfaces had 2 links up again and the error counter restarted to raise again. I then stopped em3 on both nodes to force all traffic through em4: "teamdctl team1 port remove em3" And guess what? The error counters on team1 stopped raising! So in short: If I remove either one of the two physical interfaces from team1 everything is working without any errors. As soon as I enable both physical interfaces the error counters start to raise. After these tests I rebooted the slave node (media2) and added both ports to team1 again (Have you tried turning it off and on again...). After doing this, I now see TX dropped packets on the not yet rebooted media1 team1 interface. Why do I only see these errors on the team1 interface and not on em3 and em4? Currently the copy process of 20TB is still running. I think that if I reboot media1 everything will work as expected again for some time because it worked during my initial tests. But I do think that I hit a bug in the teaming driver and the raising error counters will come back. Could anybody help? Kind regards Ingo This is the used network config of team1: media1: cat /etc/sysconfig/network-scripts/ifcfg-team1_slave_0 # Generated by parse-kickstart NAME=team1 slave 0 TEAM_MASTER=team1 DEVICETYPE=TeamPort DEVICE=em3 ONBOOT=yes UUID=b7026c5b-e9cd-457c-93d2-a2799361ed90 cat /etc/sysconfig/network-scripts/ifcfg-team1_slave_1 # Generated by parse-kickstart NAME=team1 slave 1 TEAM_MASTER=team1 DEVICETYPE=TeamPort DEVICE=em4 ONBOOT=yes UUID=fbe5bc56-95f5-4f9e-a5ea-ab6b6f5e5b50 cat /etc/sysconfig/network-scripts/ifcfg-team1 # Generated by parse-kickstart UUID=4e0090cd-32d3-4c5b-be23-9f53083da4dd NAME="Team connection team1" TEAM_CONFIG="{\"runner\": {\"name\": \"roundrobin\"}}" GATEWAY= IPV6_AUTOCONF=yes BOOTPROTO=none DEVICE=team1 MTU=9000 TYPE=Team ONBOOT=yes IPV6INIT=yes DEVICETYPE=Team IPADDR0=192.168.101.31 PREFIX0=24 DEFROUTE=yes IPV4_FAILURE_FATAL=no IPV6_DEFROUTE=yes IPV6_PEERDNS=yes IPV6_PEERROUTES=yes IPV6_FAILURE_FATAL=no media2: cat /etc/sysconfig/network-scripts/ifcfg-team1_slave_0 # Generated by parse-kickstart NAME=team1 slave 0 TEAM_MASTER=team1 DEVICETYPE=TeamPort DEVICE=em3 ONBOOT=yes UUID=bfd8e77e-7eb8-47ea-ab7e-95286e073202 cat /etc/sysconfig/network-scripts/ifcfg-team1_slave_1 # Generated by parse-kickstart NAME=team1 slave 1 TEAM_MASTER=team1 DEVICETYPE=TeamPort DEVICE=em4 ONBOOT=yes UUID=93c5dfd8-3b81-4d1e-8ba9-c46f8daceb1a cat /etc/sysconfig/network-scripts/ifcfg-team1 # Generated by parse-kickstart UUID=79c843db-996c-41e4-9ee5-2a8f3da244ed NAME="Team connection team1" TEAM_CONFIG="{\"runner\": {\"name\": \"roundrobin\"}}" GATEWAY= IPV6_AUTOCONF=yes BOOTPROTO=none DEVICE=team1 MTU=9000 TYPE=Team ONBOOT=yes IPV6INIT=yes DEVICETYPE=Team IPADDR0=192.168.101.32 PREFIX0=24 DEFROUTE=yes IPV4_FAILURE_FATAL=no IPV6_DEFROUTE=yes IPV6_PEERDNS=yes IPV6_PEERROUTES=yes IPV6_FAILURE_FATAL=no Here are some system logs from both machines: ============================================================ media1: ============================================================ Nov 6 02:31:27 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412041 iface 192.168.100.31 to [1 of 10] Nov 6 02:31:29 media1 corosync[3657]: [TOTEM ] ring 0 active with no faults Nov 6 02:31:29 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412047 iface 192.168.100.31 to [1 of 10] Nov 6 02:31:31 media1 corosync[3657]: [TOTEM ] ring 0 active with no faults Nov 6 02:32:51 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412315 iface 192.168.101.31 to [1 of 10] Nov 6 02:32:53 media1 corosync[3657]: [TOTEM ] ring 1 active with no faults Nov 6 02:32:53 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412319 iface 192.168.101.31 to [1 of 10] Nov 6 02:32:54 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412321 iface 192.168.101.31 to [2 of 10] Nov 6 02:32:55 media1 corosync[3657]: [TOTEM ] Decrementing problem counter for iface 192.168.101.31 to [1 of 10] Nov 6 02:32:56 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412325 iface 192.168.101.31 to [2 of 10] Nov 6 02:32:57 media1 corosync[3657]: [TOTEM ] Decrementing problem counter for iface 192.168.101.31 to [1 of 10] Nov 6 02:32:58 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412327 iface 192.168.101.31 to [2 of 10] Nov 6 02:32:59 media1 kernel: drbd r0: PingAck did not arrive in time. Nov 6 02:32:59 media1 kernel: drbd r0: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Nov 6 02:32:59 media1 kernel: block drbd0: new current UUID 0C0308FA4F94F811:59E43E5FD7B26A4F:1963884285B385E6:1962884285B385E6 Nov 6 02:32:59 media1 kernel: drbd r0: asender terminated Nov 6 02:32:59 media1 kernel: drbd r0: Terminating drbd_a_r0 Nov 6 02:32:59 media1 kernel: drbd r0: Connection closed Nov 6 02:32:59 media1 kernel: drbd r0: conn( NetworkFailure -> Unconnected ) Nov 6 02:32:59 media1 kernel: drbd r0: receiver terminated Nov 6 02:32:59 media1 kernel: drbd r0: Restarting receiver thread Nov 6 02:32:59 media1 kernel: drbd r0: receiver (re)started Nov 6 02:32:59 media1 kernel: drbd r0: conn( Unconnected -> WFConnection ) Nov 6 02:32:59 media1 corosync[3657]: [TOTEM ] Decrementing problem counter for iface 192.168.101.31 to [1 of 10] Nov 6 02:32:59 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412329 iface 192.168.101.31 to [2 of 10] Nov 6 02:33:01 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412333 iface 192.168.101.31 to [3 of 10] Nov 6 02:33:01 media1 corosync[3657]: [TOTEM ] Decrementing problem counter for iface 192.168.101.31 to [2 of 10] Nov 6 02:33:01 media1 crmd[4960]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Nov 6 02:33:01 media1 pengine[4959]: notice: unpack_config: On loss of CCM Quorum: Ignore Nov 6 02:33:01 media1 pengine[4959]: warning: unpack_rsc_op: Processing failed op monitor for res_nfsserver_mediafiles on media1: unknown error (1) Nov 6 02:33:01 media1 pengine[4959]: warning: unpack_rsc_op: Processing failed op monitor for res_nfsserver_mediafiles on media2: unknown error (1) Nov 6 02:33:01 media1 pengine[4959]: notice: process_pe_message: Calculated Transition 144: /var/lib/pacemaker/pengine/pe-input-227.bz2 Nov 6 02:33:01 media1 crmd[4960]: notice: run_graph: Transition 144 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-227.bz2): Complete Nov 6 02:33:01 media1 crmd[4960]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Nov 6 02:33:01 media1 kernel: drbd r0: Handshake successful: Agreed network protocol version 101 Nov 6 02:33:01 media1 kernel: drbd r0: Agreed to support TRIM on protocol level Nov 6 02:33:01 media1 kernel: drbd r0: Peer authenticated using 20 bytes HMAC Nov 6 02:33:01 media1 kernel: drbd r0: conn( WFConnection -> WFReportParams ) Nov 6 02:33:01 media1 kernel: drbd r0: Starting asender thread (from drbd_r_r0 [5999]) Nov 6 02:33:02 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412337 iface 192.168.101.31 to [3 of 10] Nov 6 02:33:02 media1 kernel: block drbd0: drbd_sync_handshake: Nov 6 02:33:02 media1 kernel: block drbd0: self 0C0308FA4F94F811:59E43E5FD7B26A4F:1963884285B385E6:1962884285B385E6 bits:192101 flags:0 Nov 6 02:33:02 media1 kernel: block drbd0: peer 59E43E5FD7B26A4E:0000000000000000:1963884285B385E6:1962884285B385E6 bits:0 flags:0 Nov 6 02:33:02 media1 kernel: block drbd0: uuid_compare()=1 by rule 70 Nov 6 02:33:02 media1 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) Nov 6 02:33:02 media1 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 185(1), total 185; compression: 100.0% Nov 6 02:33:02 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412339 iface 192.168.101.31 to [4 of 10] Nov 6 02:33:03 media1 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 185(1), total 185; compression: 100.0% Nov 6 02:33:03 media1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 Nov 6 02:33:03 media1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) Nov 6 02:33:03 media1 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Nov 6 02:33:03 media1 kernel: block drbd0: Began resync as SyncSource (will sync 919244 KB [229811 bits set]). Nov 6 02:33:03 media1 kernel: block drbd0: updated sync UUID 0C0308FA4F94F811:59E53E5FD7B26A4F:59E43E5FD7B26A4F:1963884285B385E6 Nov 6 02:33:03 media1 corosync[3657]: [TOTEM ] Decrementing problem counter for iface 192.168.101.31 to [3 of 10] Nov 6 02:33:03 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412343 iface 192.168.101.31 to [4 of 10] Nov 6 02:33:04 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412345 iface 192.168.101.31 to [5 of 10] Nov 6 02:33:05 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412347 iface 192.168.101.31 to [6 of 10] Nov 6 02:33:05 media1 corosync[3657]: [TOTEM ] Decrementing problem counter for iface 192.168.101.31 to [5 of 10] Nov 6 02:33:05 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412349 iface 192.168.101.31 to [6 of 10] Nov 6 02:33:06 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412351 iface 192.168.101.31 to [7 of 10] Nov 6 02:33:07 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412355 iface 192.168.101.31 to [8 of 10] Nov 6 02:33:07 media1 corosync[3657]: [TOTEM ] Decrementing problem counter for iface 192.168.101.31 to [7 of 10] Nov 6 02:33:07 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412357 iface 192.168.101.31 to [8 of 10] Nov 6 02:33:08 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412361 iface 192.168.101.31 to [9 of 10] Nov 6 02:33:09 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412363 iface 192.168.101.31 to [10 of 10] Nov 6 02:33:09 media1 corosync[3657]: [TOTEM ] Marking seqid 412363 ringid 1 interface 192.168.101.31 FAULTY Nov 6 02:33:11 media1 crmd[4960]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Nov 6 02:33:11 media1 pengine[4959]: notice: unpack_config: On loss of CCM Quorum: Ignore Nov 6 02:33:11 media1 pengine[4959]: warning: unpack_rsc_op: Processing failed op monitor for res_nfsserver_mediafiles on media1: unknown error (1) Nov 6 02:33:11 media1 pengine[4959]: warning: unpack_rsc_op: Processing failed op monitor for res_nfsserver_mediafiles on media2: unknown error (1) Nov 6 02:33:11 media1 pengine[4959]: notice: process_pe_message: Calculated Transition 145: /var/lib/pacemaker/pengine/pe-input-228.bz2 Nov 6 02:33:11 media1 crmd[4960]: notice: run_graph: Transition 145 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-228.bz2): Complete Nov 6 02:33:11 media1 crmd[4960]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Nov 6 02:33:17 media1 corosync[3657]: [TOTEM ] Automatically recovered ring 1 Nov 6 02:33:18 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412387 iface 192.168.101.31 to [1 of 10] Nov 6 02:33:20 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412389 iface 192.168.101.31 to [2 of 10] Nov 6 02:33:20 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412391 iface 192.168.101.31 to [3 of 10] Nov 6 02:33:21 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412393 iface 192.168.101.31 to [4 of 10] Nov 6 02:33:22 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412397 iface 192.168.101.31 to [5 of 10] Nov 6 02:33:22 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412399 iface 192.168.101.31 to [6 of 10] Nov 6 02:33:24 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412403 iface 192.168.101.31 to [7 of 10] Nov 6 02:33:31 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412421 iface 192.168.101.31 to [8 of 10] Nov 6 02:33:32 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412423 iface 192.168.101.31 to [9 of 10] Nov 6 02:33:34 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412429 iface 192.168.101.31 to [10 of 10] Nov 6 02:33:34 media1 corosync[3657]: [TOTEM ] Marking seqid 412429 ringid 1 interface 192.168.101.31 FAULTY Nov 6 02:33:36 media1 corosync[3657]: [TOTEM ] Automatically recovered ring 1 Nov 6 02:33:39 media1 corosync[3657]: [TOTEM ] Incrementing problem counter for seqid 412437 iface 192.168.101.31 to [1 of 10] ============================================================ media2: ============================================================ Nov 6 02:32:59 media2 kernel: drbd r0: sock was shut down by peer Nov 6 02:32:59 media2 kernel: drbd r0: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Nov 6 02:32:59 media2 kernel: drbd r0: short read (expected size 16) Nov 6 02:32:59 media2 kernel: drbd r0: asender terminated Nov 6 02:32:59 media2 kernel: drbd r0: Terminating drbd_a_r0 Nov 6 02:32:59 media2 kernel: drbd r0: Connection closed Nov 6 02:32:59 media2 kernel: drbd r0: conn( BrokenPipe -> Unconnected ) Nov 6 02:32:59 media2 kernel: drbd r0: receiver terminated Nov 6 02:32:59 media2 kernel: drbd r0: Restarting receiver thread Nov 6 02:32:59 media2 kernel: drbd r0: receiver (re)started Nov 6 02:32:59 media2 kernel: drbd r0: conn( Unconnected -> WFConnection ) Nov 6 02:33:01 media2 attrd[4940]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-res_drbd_mediafiles (1000) Nov 6 02:33:01 media2 attrd[4940]: notice: attrd_perform_update: Sent update 16: master-res_drbd_mediafiles=1000 Nov 6 02:33:01 media2 kernel: drbd r0: Handshake successful: Agreed network protocol version 101 Nov 6 02:33:01 media2 kernel: drbd r0: Agreed to support TRIM on protocol level Nov 6 02:33:01 media2 kernel: drbd r0: Peer authenticated using 20 bytes HMAC Nov 6 02:33:01 media2 kernel: drbd r0: conn( WFConnection -> WFReportParams ) Nov 6 02:33:01 media2 kernel: drbd r0: Starting asender thread (from drbd_r_r0 [6367]) Nov 6 02:33:01 media2 kernel: block drbd0: drbd_sync_handshake: Nov 6 02:33:01 media2 kernel: block drbd0: self 59E43E5FD7B26A4E:0000000000000000:1963884285B385E6:1962884285B385E6 bits:0 flags:0 Nov 6 02:33:01 media2 kernel: block drbd0: peer 0C0308FA4F94F811:59E43E5FD7B26A4F:1963884285B385E6:1962884285B385E6 bits:192101 flags:0 Nov 6 02:33:01 media2 kernel: block drbd0: uuid_compare()=-1 by rule 50 Nov 6 02:33:01 media2 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate ) Nov 6 02:33:02 media2 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 185(1), total 185; compression: 100.0% Nov 6 02:33:03 media2 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 185(1), total 185; compression: 100.0% Nov 6 02:33:03 media2 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) Nov 6 02:33:03 media2 kernel: block drbd0: updated sync uuid 59E53E5FD7B26A4E:0000000000000000:1963884285B385E6:1962884285B385E6 Nov 6 02:33:03 media2 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 Nov 6 02:33:03 media2 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) Nov 6 02:33:03 media2 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) Nov 6 02:33:03 media2 kernel: block drbd0: Began resync as SyncTarget (will sync 919244 KB [229811 bits set]). Nov 6 02:33:10 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412364 iface 192.168.101.32 to [1 of 10] Nov 6 02:33:10 media2 corosync[3690]: [TOTEM ] Automatically recovered ring 1 Nov 6 02:33:10 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412366 iface 192.168.101.32 to [1 of 10] Nov 6 02:33:11 media2 corosync[3690]: [TOTEM ] Automatically recovered ring 1 Nov 6 02:33:11 media2 attrd[4940]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-res_drbd_mediafiles (10) Nov 6 02:33:11 media2 attrd[4940]: notice: attrd_perform_update: Sent update 18: master-res_drbd_mediafiles=10 Nov 6 02:33:11 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412368 iface 192.168.101.32 to [1 of 10] Nov 6 02:33:12 media2 corosync[3690]: [TOTEM ] ring 1 active with no faults Nov 6 02:33:12 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412370 iface 192.168.101.32 to [1 of 10] Nov 6 02:33:12 media2 corosync[3690]: [TOTEM ] Automatically recovered ring 1 Nov 6 02:33:12 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412372 iface 192.168.101.32 to [1 of 10] Nov 6 02:33:13 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412374 iface 192.168.101.32 to [2 of 10] Nov 6 02:33:14 media2 corosync[3690]: [TOTEM ] Decrementing problem counter for iface 192.168.101.32 to [1 of 10] Nov 6 02:33:14 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412376 iface 192.168.101.32 to [2 of 10] Nov 6 02:33:15 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412378 iface 192.168.101.32 to [3 of 10] Nov 6 02:33:15 media2 corosync[3690]: [TOTEM ] Automatically recovered ring 1 Nov 6 02:33:15 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412380 iface 192.168.101.32 to [1 of 10] Nov 6 02:33:16 media2 corosync[3690]: [TOTEM ] ring 1 active with no faults Nov 6 02:33:16 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412382 iface 192.168.101.32 to [1 of 10] Nov 6 02:33:16 media2 corosync[3690]: [TOTEM ] Automatically recovered ring 1 Nov 6 02:33:17 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412384 iface 192.168.101.32 to [1 of 10] Nov 6 02:33:17 media2 corosync[3690]: [TOTEM ] Automatically recovered ring 1 Nov 6 02:33:17 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412386 iface 192.168.101.32 to [1 of 10] Nov 6 02:33:18 media2 corosync[3690]: [TOTEM ] ring 1 active with no faults Nov 6 02:33:19 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412388 iface 192.168.100.32 to [1 of 10] Nov 6 02:33:20 media2 corosync[3690]: [TOTEM ] ring 0 active with no faults Nov 6 02:33:35 media2 corosync[3690]: [TOTEM ] Automatically recovered ring 1 Nov 6 02:33:36 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412430 iface 192.168.101.32 to [1 of 10] Nov 6 02:33:36 media2 corosync[3690]: [TOTEM ] Automatically recovered ring 1 Nov 6 02:33:37 media2 corosync[3690]: [TOTEM ] Incrementing problem counter for seqid 412432 iface 192.168.101.32 to [1 of 10] Nov 6 02:33:38 media2 corosync[3690]: [TOTEM ] ring 1 active with no faults

7 years, 7 months

3
7
0 / 0

SegFault (was Re: Setting default gateway?)

by PanaColina

Trying on my own to see if I could get anything to work, I got a segfault when I tried using arp_ping rather than ethtool for the link-watchers. As this occurs after the teamd fails to set the port hardware address, this may not be directly related to my problem, but I thought it worth mentioning. Below is the relevant syslog. Thank you for any suggestions on getting teamd to work. J.L. Hill teamd 1.15 Kernel 3.18.1 Debian Sid Jan 8 13:48:24 thor teamd_myteam0[16354]: Added loop callback: daemon, 0x1f323c0 Jan 8 13:48:24 thor teamd_myteam0[16354]: Added loop callback: libteam_events, 0x1f323c0 Jan 8 13:48:24 thor teamd_myteam0[16354]: Added loop callback: workq, 0x1f323c0 Jan 8 13:48:24 thor teamd_myteam0[16354]: Using team runner "activebackup". Jan 8 13:48:24 thor teamd_myteam0[16354]: Using hwaddr_policy "same_all". Jan 8 13:48:24 thor teamd_myteam0[16354]: usock: Using sockpath "/var/run/teamd/myteam0.sock" Jan 8 13:48:24 thor teamd_myteam0[16354]: Added loop callback: usock, 0x1f323c0 Jan 8 13:48:24 thor teamd_myteam0[16354]: <ifinfo_list> Jan 8 13:48:24 thor teamd_myteam0[16354]: 8: myteam0: 7e:4f:85:20:c6:49: 0 Jan 8 13:48:24 thor teamd_myteam0[16354]: </ifinfo_list> Jan 8 13:48:24 thor teamd_myteam0[16354]: <port_list> Jan 8 13:48:24 thor teamd_myteam0[16354]: </port_list> Jan 8 13:48:24 thor teamd_myteam0[16354]: eth0: Adding port (found ifindex "2"). Jan 8 13:48:24 thor teamd_myteam0[16354]: eth1: Adding port (found ifindex "3"). Jan 8 13:48:24 thor teamd_myteam0[16354]: 1.15 successfully started. Jan 8 13:48:24 thor teamd_myteam0[16354]: <ifinfo_list> Jan 8 13:48:24 thor teamd_myteam0[16354]: 8: myteam0: 7e:4f:85:20:c6:49: 0 Jan 8 13:48:24 thor teamd_myteam0[16354]: </ifinfo_list> Jan 8 13:48:24 thor teamd_myteam0[16354]: <port_list> Jan 8 13:48:24 thor teamd_myteam0[16354]: *2: eth0: up 100Mbit FD Jan 8 13:48:24 thor teamd_myteam0[16354]: </port_list> Jan 8 13:48:24 thor teamd_myteam0[16354]: eth0: Got link watch from port config. Jan 8 13:48:24 thor teamd_myteam0[16354]: eth0: Using sticky "0". Jan 8 13:48:24 thor teamd_myteam0[16354]: eth0: Failed to set port hardware address. Jan 8 13:48:24 thor teamd_myteam0[16354]: Failed to init port priv. Jan 8 13:48:24 thor teamd_myteam0[16354]: Callback named "lw_periodic" not found. Jan 8 13:48:24 thor teamd_myteam0[16354]: Callback named "lw_socket" not found. Jan 8 13:48:25 thor kernel: [258034.039534] myteam0: Mode changed to "activebackup" Jan 8 13:48:25 thor kernel: [258034.040080] atl1c 0000:03:00.0: irq 29 for MSI/MSI-X Jan 8 13:48:25 thor kernel: [258034.040274] atl1c 0000:03:00.0: atl1c: eth0 NIC Link is Up<100 Mbps Full Duplex> Jan 8 13:48:25 thor kernel: [258034.040492] myteam0: Port device eth0 added Jan 8 13:48:25 thor kernel: [258034.176230] r8169 0000:04:00.0 eth1: link down Jan 8 13:48:25 thor kernel: [258034.176242] r8169 0000:04:00.0 eth1: link down Jan 8 13:48:25 thor kernel: [258034.176286] myteam0: Port device eth1 added Jan 8 13:48:25 thor kernel: [258034.176969] show_signal_msg: 22 callbacks suppressed Jan 8 13:48:25 thor kernel: [258034.176972] teamd[16354]: segfault at 8 ip 000000000040b1fd sp 00007fff3cdf9780 error 4 in teamd[400000+1d000] Jan 8 13:48:26 thor kernel: [258036.309193] r8169 0000:04:00.0 eth1: link up On Mon, Jan 5, 2015 at 2:52 PM, PanaColina <panama.hill(a)gmail.com> wrote: > Okay, I believe I did the strace correctly. Log file attached. > > Again, I very much appreciate the assistance, > > J.L. Hill > > > > On Mon, Jan 5, 2015 at 1:44 PM, Flavio Leitner <fbl(a)sysclose.org> wrote: >> On Monday, January 05, 2015 01:15:06 PM PanaColina wrote: >>> Thank you again. Unfortunately, it seems I am misunderstanding. >>> >>> When you say that both of my ports do not support changing the HW >>> address, I assume that means my NICs do not allow their MAC addresses >>> to be changed. However, it seems I can change their addresses via the >>> ip command with no problem (see below). >>> >>> I guess I need details on what NIC card requirements teamd has. I can >>> buy new cards, if I know what I'm looking for. >> >> teamd should be able to change HW too. >> Could you put somewhere the output of strace command? >> # strace -f -s 1024 -o /tmp/teamd.log -p <pid of teamd> >> >> Run that command and then: >> # teamdctl myteam0 port add eth0 >> >> after that stop the strace with ctrl+c and you should have >> the /tmp/teamd.log with traces inside. >> >> That could give us a clue. >> >> fbl >> >> >> >>> Thank you for your patience, >>> >>> J.L. Hill >>> --------------------- >>> >>> # ip add ls eth0 >>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast >>> state UP group default qlen 1000 >>> link/ether 50:e5:49:ec:d6:3f brd ff:ff:ff:ff:ff:ff >>> inet 192.168.0.101/24 brd 192.168.0.255 scope global eth0 >>> valid_lft forever preferred_lft forever >>> >>> # ifdown eth0 >>> Killed old client process >>> ... >>> Listening on LPF/eth0/50:e5:49:ec:d6:3f >>> Sending on LPF/eth0/50:e5:49:ec:d6:3f >>> Sending on Socket/fallback >>> DHCPRELEASE on eth0 to 192.168.0.1 port 67 >>> >>> # ip link set eth0 address 02:01:02:03:04:08 >>> # ip link ls eth0 >>> 2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN >>> mode DEFAULT group default qlen 1000 >>> link/ether 02:01:02:03:04:08 brd ff:ff:ff:ff:ff:ff >>> root@thor:/home/jh# ip addr ls eth0 >>> 2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN >>> group default qlen 1000 >>> link/ether 02:01:02:03:04:08 brd ff:ff:ff:ff:ff:ff >>> >>> On Mon, Jan 5, 2015 at 12:14 PM, Flavio Leitner <fbl(a)sysclose.org> wrote: >>> > On Monday, January 05, 2015 10:24:13 AM PanaColina wrote: >>> >> Thank you for your response -- testing with activebackup pointed >>> >> towards one problem: >>> >> >>> >> Jan 5 08:42:04 thor teamd_myteam0[5549]: eth0: Got link watch from port config. >>> >> Jan 5 08:42:04 thor teamd_myteam0[5549]: eth0: Using sticky "0". >>> >> Jan 5 08:42:04 thor teamd_myteam0[5549]: eth0: Failed to set port >>> >> hardware address. >>> > >>> > Either roundrobin or activebackup modes requires changing the HW >>> > address for all ports. If that fails, the mode doesn't work. >>> > >>> > It seems both your ports don't support that, unfortunately. >>> > >>> > fbl >>

9 years, 3 months

2
2
0 / 0

Setting default gateway?

by PanaColina

Setting up the team driver on Debian Sid seems straight forward, and the links seem to go up without error, but I cannot ping my router IP address or otherwise connect outside the box. I am no expert in networking, and am guessing the error may be obvious to someone on the list, or someone might be able to point me in the right direction to search for a solution. To me, it seems this may be a gateway issue as, after setting up the team driver, route -n gives: Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 myteam0 If I set the ip address of the router connected to the nic on eth0 as the gateway, route -n gives: Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 192.168.0.1 0.0.0.0 UG 0 0 0 myteam0 192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 myteam0 But pinging 192.168.0.1 results in "Destination Host Unreachable". If I try setting a default gateway using myteam0 or eth0, the route command returns "Unknown host". After my initial day of failures, I commented out the eth0 and eth1 stanzas in the /etc/network/interfaces file and purged network-manager to avoid potential problems. The original configuration has the router connected to eth0 set the ip address of eth0 as 192.168.0.101 through DHCP. I tested initially with the Debian Sid libteam packages (v.12) and then built v.15 from libteam-master.zip source. I have done most of my testing with the basic commands: teamd -g -t myteam0 -d teamdctl myteam0 port add eth0 teamdctl myteam0 port add eth1 ip addr add 192.168.0.101/24 dev myteam0 ip link set myteam0 up Before running the above commands to start teaming, ip link shows: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 50:e5:49:ec:d6:3f brd ff:ff:ff:ff:ff:ff 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether e8:de:27:06:85:d5 brd ff:ff:ff:ff:ff:ff After running the teaming commands, ip link shows: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master myteam0 state UP mode DEFAULT group default qlen 1000 link/ether 9e:a4:72:48:1d:f7 brd ff:ff:ff:ff:ff:ff 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master myteam0 state UP mode DEFAULT group default qlen 1000 link/ether 50:e5:49:ec:d6:3f brd ff:ff:ff:ff:ff:ff 4: myteam0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default link/ether 50:e5:49:ec:d6:3f brd ff:ff:ff:ff:ff:ff and ip addr shows: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master myteam0 state UP group default qlen 1000 link/ether 9e:a4:72:48:1d:f7 brd ff:ff:ff:ff:ff:ff 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master myteam0 state UP group default qlen 1000 link/ether 50:e5:49:ec:d6:3f brd ff:ff:ff:ff:ff:ff 4: myteam0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 50:e5:49:ec:d6:3f brd ff:ff:ff:ff:ff:ff inet 192.168.0.101/24 scope global myteam0 valid_lft forever preferred_lft forever I have looked at syslog and see no errors or warnings. It seems everything is fine as syslog shows: Jan 4 17:09:56 thor avahi-daemon[3647]: Joining mDNS multicast group on interface myteam0.IPv4 with address 192.168.0.101. Jan 4 17:09:56 thor avahi-daemon[3647]: New relevant interface myteam0.IPv4 for mDNS. Jan 4 17:09:56 thor avahi-daemon[3647]: Registering new address record for 192.168.0.101 on myteam0.IPv4. Jan 4 17:09:57 thor ntpd[3708]: Listen normally on 2 myteam0 192.168.0.101 UDP 123 teamdctl myteam0 config dump { "device": "myteam0", "ports": { "eth0": { "link_watch": { "name": "ethtool" } }, "eth1": { "link_watch": { "name": "ethtool" } } }, "runner": { "name": "roundrobin" } } teamdctl myteam0 state view setup: runner: roundrobin ports: eth1 link watches: link summary: up instance[link_watch_0]: name: ethtool link: up eth0 link watches: link summary: up instance[link_watch_0]: name: ethtool link: up I am running kernel 3.18.1 x86_64 with the .config: CONFIG_NET_TEAM=y CONFIG_NET_TEAM_MODE_BROADCAST=m CONFIG_NET_TEAM_MODE_ROUNDROBIN=m CONFIG_NET_TEAM_MODE_RANDOM=m CONFIG_NET_TEAM_MODE_ACTIVEBACKUP=m CONFIG_NET_TEAM_MODE_LOADBALANCE=m Any suggestions appreciated, J.L. Hill

9 years, 3 months

2
6
0 / 0

← Newer
1
Older →

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

libteam January 2015