[RFC PATCH] Resolve an abnormal exit when consensus timeout expired.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>

In our 20 nodes cluster testing(corosync v1.4.1 vs sheepdog), an abnormal exit
would occur when consensus timeout expired and if there was no other processors
in consensus_list.

= analysis =
1. when consenus timeout, corosync would enter memb_state_consensus_timeout_expired
   function.

2. if its consensus_list only contain my_id, the code would
   execute memb_set_merge which make my_failed_list equal to
   my_proc_list.

3. call memb_state_gather_enter function and mcast a join message which
   contain proc_list and failed_list with the same processor IDs.

4. the join message would be received by itself and
   memb_join_process/memb_consensus_agreed would by called.

5. because the proc_list equal to failed_list in the join message,
   the assert instruction will be reached, then an abnormal exit occur.

= solution =
This patch try to resolve this issue by remove my_id from
my_failed_list before corosync call memb_state_gather_enter from
memb_state_consensus_timeout_expired.

when network partition occur and the processor can't communicate with
all other processors, it will form an single ring with only itself other
than trigger abnormal exit.

Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
---
 exec/totemsrp.c |   33 +++++++++++++++++++++++++++++++++
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index 0778d55..653f801 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -1324,6 +1324,33 @@ static void memb_set_merge (
 	return;
 }
 
+/*
+ * remove subset from fullset
+ */
+static void memb_set_remove(
+	const struct srp_addr *subset, int subset_entries,
+	struct srp_addr *fullset, int *fullset_entries)
+{
+	int found = 0;
+	int i;
+	int j;
+
+	for (i = 0; i < subset_entries; i++) {
+		for (j = 0; j < *fullset_entries; j++) {
+			if (srp_addr_equal (&fullset[j], &subset[i])) {
+				found = 1;
+				break;
+			}
+		}
+		if (found == 1) {
+			for (; j < (*fullset_entries-1); j++) {
+				srp_addr_copy (&fullset[j], &fullset[j+1]);
+			}
+			*fullset_entries = *fullset_entries - 1;
+		}
+	}
+}
+
 static void memb_set_and_with_ring_id (
 	struct srp_addr *set1,
 	struct memb_ring_id *set1_ring_ids,
@@ -1541,6 +1568,12 @@ static void memb_state_consensus_timeout_expired (
 
 		memb_set_merge (no_consensus_list, no_consensus_list_entries,
 			instance->my_failed_list, &instance->my_failed_list_entries);
+
+		if (instance->my_proc_list_entries == instance->my_failed_list_entries){
+			memb_set_remove (&instance->my_id, 1,
+				instance->my_failed_list, &instance->my_failed_list_entries);
+		}
+
 		memb_state_gather_enter (instance, 0);
 	}
 }
-- 
1.7.6.4

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux