[PATCH v2] [TotemSRP] Ignore duplicated commit tokens in recovery mode

jason <huzhijiang@xxxxxxxxx> · Sat, 10 Jan 2015 17:42:20 +0800

In active rrp mode, commit tokens are treated as mcast data messages,
thus, rrp directly delivers them to srp layer by active_mcast_recv().
This will result in duplicated commit tokens being received by srp
from different heartbeat links. If node is in recovery state and has
already sent out the initial orf token, those duplicated commit tokens
will cause message_handler_memb_commit_token() to send initial orf
token again! This is wrong because it resets the orf token content in
instance->orf_token_retransmit, which breaks the token retransmission
state.

Furthermore, by sending those initial orf tokens again and again, it
may lead active_token_recv() to drop some subsequent orf tokens. It is
OK for rrp because srp will do token retransmission, but as said
above, srp retransmission state has already been broken, so finally we
meet a "token lost in recovery state" condition caused by software. If
token timeout value is large, then it will takes long time to create a
new ring.

This can be reproduced by having two noded set to active rrp mode,
with two heartbeat links. Then with one node always on, let the other
one do stop/start again and again. It has a low probability to
reproduce. In theory, I think, the more heartbeat links used, the more
easily it can be reproduced.

This problem can be resolved by letting
message_handler_memb_commit_token() to ignore duplicated commit tokens
in recovery state if node (the ring representation) has already sent
out the initial orf token.

Different from prev take, this version do not depends on stored token
data but uses originated_orf_token in totemsrp_instance to remember if
initial orf token has been already originated for current membership.

-- 
Yours,
Jason
Attachment:
0001-PATCH-v2-TotemSRP-Ignore-duplicated-commit-tokens-in.patch

Description: Binary data
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss