Re: Packet loss after configuring Ethernet bonding

Digimer <lists@xxxxxxxxxx> · Fri, 09 Nov 2012 23:22:44 -0500

On 11/09/2012 11:12 PM, Zama Ques wrote:
> ----- Original Message -----
> From: Digimer <lists@xxxxxxxxxx>
> To: Zama Ques <queszama@xxxxxxxx>; linux clustering <linux-cluster@xxxxxxxxxx>
> Cc: 
> Sent: Saturday, 10 November 2012 8:24 AM
> Subject: Re:  Packet loss after configuring Ethernet bonding
> 
> On 11/09/2012 09:26 PM, Zama Ques wrote:
>> Hi All, 
>>
>> Need help on resolving a issue related to implementing High Availability at network level . I understand that this is not the right forum to ask this question , but since it is related to HA and Linux , I am asking here and I feel somebody here  will have answer to the issues I am facing .
>>
>> I am trying to implement Ethernet Bonding , Both the interface in my server are connected to two different network switches . 
>>
>> My configuration is as follows: 
>>
>> ========
>> # cat /proc/net/bonding/bond0
>>
>> Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
>>
>> Bonding Mode: adaptive load balancing Primary Slave: None Currently 
>> Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay 
>> (ms): 0 Down Delay (ms): 0
>>
>> Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link 
>> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:10 Slave queue ID: 0
>>
>> Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link 
>> Failure Count: 0 Permanent HW addr: e4:e1:5b:d0:11:14 Slave queue ID: 0
>> ------------
>> # cat /sys/class/net/bond0/bonding/mode 
>>
>>    balance-alb 6
>>
>>
>> # cat /sys/class/net/bond0/bonding/miimon
>>     0
>>
>> ============
>>
>>
>> The issue for me is that I am seeing packet loss after configuring bonding .  Tried connecting both the interface to the same switch , but still seeing the packet loss . Also , tried changing miimon value to 100 , but still seeing the packet loss. 
>>
>> What I am missing in the configuration ? Any help will be highly appreciated in resolving the problem . 
>>
>>
>>
>> Thanks
>> Zaman
> 
>  > You didn't share any details on your configuration, but I will assume
>> you are using corosync.
> 
>> The only supported bonding mode is Active/Passive (mode=1). I've
>> personally tried all modes, out of curiosity, and all had problems. The
>> short of it is that if you need more that 1 gbit of performance, buy
>> faster cards.
> 
>> If you are interested in what I use, it's documented here:
> 
>>   https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network
> 
>>   I've used this setup in several production clusters and have tested
>>   failure are recovery extensively. It's proven very stable. :)
> 
>  
> Thanks Digimer for the quick response and pointing me to the link . I am yet to reach cluster configuration , initially trying to  understand ethernet bonding before going into cluster configuration. So , option for me is only to use Active/Passive bonding mode in case of clustered environment. 
> Few more clarifications needed , Can we use other bonding modes in non clustered environment .  I am seeing packet loss in other modes . Also , the support of  using only mode=1 in cluster environment is it a restriction of RHEL Cluster suite or it is by design . 
> 
> Will be great if you clarify these queries .
> 
> Thanks in Advance
> Zaman

Corosync is the only actively developed/supported (HA) cluster
communications and membership tool. It's used on all modern distros for
clustering and the requirement for mode=1 is with it. As such, it
doesn't matter which OS you are on, it's the only mode that will work
(reliably).

The problem is that corosync needs to detect state changes quickly. It
does this using the totem protocol (which serves other purposes), which
passes a token around the nodes in the cluster. If a node is sent a
token and the token is not returned within a time-out period, it is
declared lost and a new token is dispatched. Once too many failures
occur in a row, the node is declared lost and it is ejected from the
cluster. This process is detailed in the link above under the "Concept;
Fencing" section.

With all modes other than mode=1, the failure recovery and/or the
restoration of a link in the bond causes a sufficient disruption to
cause a node to be declared lost. As I mentioned, this matches my
experience in testing the other modes. It isn't an arbitrary rule.

As for non-clustered traffic; the usefulness of other bond modes depends
entirely on the traffic you are pushing over it. Personally, I am
focused on HA in clusters, so I only use mode=1, regardless of the
traffic designed for it.

digimer

ps - You will see reference to "heartbeat" as a comms layer in
clustering. It's been deprecated and should not be used. Likewise,
pacemaker is the future of clustering, so it should be to resource
manager you learn/use.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster