When there are multiple node joins close together some nodes can be kicked out of the CPG groups

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I am seeing a situation where when there are 10 nodes joining the cluster at the same time there can be a temporary split in the cluster before it rectifies the situation and has all nodes joined - this is in totem. 
A join is seen, the nodes move to gather & eventually recovery.  During Recovery another node attempts to join, some of the nodes have already moved to OPERATIONAL but not all.  When one of the OPERATIONAL nodes sees the new nodes join it will send it's own join to all nodes.  When the nodes that are still in RECOVERY see the join from an operational member they will move back to GATHER & RECOVERY.  When they do this the token information has their ringids set to the previous ringid, while the nodes that went to OPERATIONAL have the new ringid.  This causes the transitional membership to be calculated such that the two parts of the cluster are split.

This does seem to be correct behaviour according the Totem specification.  I am correct about this?

The problem I have is that having *split* what CPG decides is that the nodes have left (from choosing the downlist that has 1 or nodes having left) and therefore removes the process information for that node.  Therefore if the nodes wanted to keep joined to the CPG group they would have to join it again after a *split* occurred even though were still part of the totem topology.

I am running corosync 1.3.4.  Should the application have to redo a CPG join after it gets kicked out?

Thanks,
John
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux