can the changing node identity returned by local_get be handled reliably?

dan clark <2clarkd@xxxxxxxxx> · Thu, 12 Apr 2012 13:17:31 -0700

Hi Folks!

Thank you Christine for a well written test application and leading the way with the apropos comment "NOTE: in reality we should also check the nodeid". Some comments are more easily addressed then others!

During various failure tests a variety of conditions seem to trigger a client application of the cpg library to change it's node identity.  I know this has been discussed under various guises with respect to the proper way to fail a network (don't ifdown/ifup and interface.  Oddly enough, however, a common dynamic reconfiguration step on a node is to do a 'service network restart' which tends to do ifdown/ifup on interfaces.  Designing applications to be resilient to common failures is often desirable, including the restart of a service (such as corosync) so I have included a slightly modified version of testcpg.c that provides such resiliency.  I wonder, however, if the nature of the changing identity of the node information returned from cpg_local_get can be relied on across versions or if this is abhorrent or transient behaviour that might change?   Note, that once the node identity has changed if an application continues to maintain use of a group then once the cluster is reformed that group is isolated from other groups, despite sharing a common name.  Furthermore there are impacts on other applications on the isolated node that might share the use of that group. 

On a separate note, is there a way to change the seemingly fixed 20-30 second delay upon the daemons re-joining a cluster separated due to isolated network conditions (power cycling a switch for example)?

Note the following output indicating the first realization of a node that the configuration has changed and how the local_get indicates that the node identity is now (127.0.0.1) as opposed to the original value which was  (4.0.0.8).  Tests performed on version 1.4.2.

 dcube-8:28506 2012-04-09 14:02:06.803
ConfchgCallback: group 'GROUP'
Local node id is 127.0.0.1/100007f result 1
left node/pid 4.0.0.2/15796 reason: 3

nodes in group now 3
node/pid 4.0.0.4/2655
node/pid 4.0.0.6/4238
node/pid 4.0.0.8/28506
....

Finally even though the reported identity is loopback, the original id is matched due to the static cache from the time of the join.  Is there a race condition, however, that just after the join if there is a network failure that the identity might change before the initialization logic is complete and thus even the modified sample program is open to a failure?  

dcube-8:28506 2012-04-09 14:02:06.803
ConfchgCallback: group 'GROUP'
Local node id is 127.0.0.1/100007f result 1
left node/pid 4.0.0.8/28506 reason: 3

nodes in group now 0
We might have left the building pid 28506
We probably left the building switched identity? start nodeid 134217732 nodeid 134217732 current nodeid 16777343 pid 28506
We have left the building direct match start nodeid 134217732 nodeid 134217732 local get current nodeid 16777343 pid 28506

Perhaps the test application for the release could be updated to include appropriate testing for the nodeid?

Dan

Attachment:
testcpg.c.local_get.gz

Description: GNU Zip compressed data
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss