Re: is there maximum tolerated duration of packet loss for membership changes?

Steven Dake <sdake@xxxxxxxxxx> · Fri, 27 Jul 2012 09:32:42 -0700

Dan,

Here are my thoughts

On 07/19/2012 11:16 AM, dan clark wrote:
> Dear gentle readers!
> 
> 1. Is there a maximum duration of packet loss between any single (or
> all) nodes that can be tolerated without impacting the membership within
> corosync?

The healthchecking is based upon two things - the token loss timer and
the seqno_unchanged_const.  If no token is received by a node that
originated a token within token loss timeout, the membership protocol
will be invoked and form a new membership.  If the token rotates
seqno_unchanged_cost times without delivering a required multicast
message, the membership protocol will be invoked.  That second one is
not based on a time, but rather the number of rotations.  The time it
takes for a rotation is dependent on ring size and number of pending
messages that are originated.  I think we calculated 5000 rotations (ie:
seqno unchaned const =5000) was about 10 seconds.

> 2. Is there a difference between tolerating packet loss for node
> membership versus group membership changes?

If packets are lost during the membership protocol, they will be resent
based upon the "join" variable: in the default setting every 50
milliseconds.  If those messages can't get through because of switch
overload, a smaller membership may be generated, followed by rentry into
the membership protocol.

> 3. Is there a way to adjust the duration of the loss tolerance,
> extending to some greater time?

increase token or seqno_unchanged_const

> 4. During an event with loss, is there some ramification to
> communication between group members, such as EAGAIN for senders?

When the membership protocol is in process, cpg returns CS_ERR_TRYAGAIN

> 5. Is the loss impacted by the choice of communication (mcastaddr vs udp
> point to point)?

udpu generates more traffic and is more likely to overrun the receive
buffers of the nodes (which is generally responsible for packet loss).
Also overloading the output buffers in the kernel could result in packet
loss.  We make an attempt to avoid this in udpu by having a separate fd
per node - but there may have been recent changes here - ask Honza on
this topic.

> 6. Does the volume of communication over a group potentially impact the
> tolerance to loss or expected results from the sender/receiver of the
> group communication?

With multicast, the switch is responsible for copying multicast packets
to other nodes.  It is possible the switch could become overloaded in
heavy load situations triggering lost packets.  Totem recovers from this
problem, but if seqno_unchanged_const is too low (which at default of 50
it is), you could see a membership change.

> 7. Does the frequency of packet loss impact the policy (for example a
> periodically busy network causing intermittent packet loss)?

totem corrects from packet loss by recovering lost packets it has a copy
of.  totem should maintain copies of all packets until a membership
change occurs, in which case virtual synchrony guarantees are maintained
(which can permit message loss if the node that originated the message
disappears and is the only node that had a copy of the message)
> 8. Once communication is re-established how long before membership
> change events allow the node back in?

Depends on your configurable values, but generally its token + 2*consensus

> 9. How long until the individual members resume? (never, isolated group
> must be closed and re-opened?).
> 

token + 2 * consensus

> background:
> For high hardware availability it might be desirable to have two
> switches allowing no single point of failure.  Various switch vendors
> handle duplicity of hardware in many different ways.  For example, one
> vendor might detect the failure of a switch and then reset the remaining
> switch to take over some responsibilities resulting in an outage (link
> down) of all connections for as long as 10 seconds (I know, unbelievable
> but true!)!  Other configurations / hardware may lead to a single
> failure resulting as much as 0.5 seconds of packet loss to all messages,
> but retain link.  Internal bonds may exhibit behaviours that drop
> packets for 200-400msec.  Still other failure scenarios may drop packets
> for times reaching up to 60 msec. 
> 
> discussion:
> Is there a designed tolerance to packet loss by the group communications
> infrastructure (cpg) in the various versions (1.4.2, 2.x) and what the
> expected results are below and above that threshold for membership changes?
> 
> Thank you for considering such an extensive query...
> 

depends on your seqno_unchanged_const - if your seeing membership
changes in an unexpected way, you may want to increase this tuneable to
be more tolerant.

> speculation:
> the token retransmit value defaults to 1000 with a token retransmits
> before  loss count value of 4, allowing up to 4 seconds of failure
> before triggering a membership change.  Group membership and node
> membership run in parallel, but group membership does not automatically
> reform into larger group.  The problem count threshold and timeout hold
> a tight duration (~50msec) so loss frequency greater than this value are
> considered independent events.  Senders get EAGAIN type responses once
> an internal outbound queue reach a threshold but this threshold is very
> high and may lead to large memory use by application for fast
> senders/slow receivers or faced with lossy or intermittently congested
> networks.
>

yes we are aware of this problem with slow delivery clients and high
memory consumption.  At some point we would like to fix this problem,
but not sure when that will happen.

Regards
-steve

> dan
> 
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss