Hi, We built a cluster consist of three nodes and start/stop one of these nodes repeatedly. The test script is shown like this: 1 #!/bin/sh 2 3 LOOP_COUNT=1000 4 5 while [ $LOOP_COUNT -gt 0 ]; 6 do 7 let "LOOP_COUNT-=1" 8 echo "test No. $((1000-LOOP_COUNT))" 9 rcopenais start 10 sleep 30 11 rcopenais stop 12 sleep 10 13 done The error log looks like: Apr 3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist Apr 3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist Several times after this error appears first, it leads to this node being fenced. After some analysis, we think there is a race condition between corosync and openais CKPT service. So we formed a patch which can avoid this problem effectively. The patch is attached below. Any review is highly appreciated. Thanks
Index: openais-1.1.4/services/ckpt.c =================================================================== --- openais-1.1.4.orig/services/ckpt.c +++ openais-1.1.4/services/ckpt.c @@ -776,14 +776,17 @@ static void ckpt_confchg_fn ( unsigned int i, j; unsigned int lowest_nodeid; + if (!memcmp (&my_saved_ring_id, ring_id,sizeof (struct memb_ring_id))) { + if (my_sync_state != SYNC_STATE_NOT_STARTED) { + return; + } + } + if (configuration_type != TOTEM_CONFIGURATION_REGULAR) { + return; + } + memcpy (&my_saved_ring_id, ring_id, sizeof (struct memb_ring_id)); - if (configuration_type != TOTEM_CONFIGURATION_REGULAR) { - return; - } - if (my_sync_state != SYNC_STATE_NOT_STARTED) { - return; - } my_sync_state = SYNC_STATE_STARTED;
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss