If failed_to_recv is set (node detect itself not able to receive message), we can end up with assert, because my_failed_list and my_member_list are same list. This is happening because we are not following specification and we allow to mark node itself as failed. Because if failed_to_recv is set and we reached consensus across nodes, single node membership is created (ignoring both fail list and member_list), we can skip assert. Signed-off-by: Jan Friesse <jfriesse@xxxxxxxxxx> --- exec/totemsrp.c | 15 +++++++++++++++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/exec/totemsrp.c b/exec/totemsrp.c index ec951df..a4cc19a 100644 --- a/exec/totemsrp.c +++ b/exec/totemsrp.c @@ -1247,6 +1247,16 @@ static int memb_consensus_agreed ( break; } } + + if (agreed && instance->failed_to_recv == 1) { + /* + * Both nodes agreed on our failure. We don't care how many proc list items left because we + * will create single ring anyway. + */ + + return (agreed); + } + assert (token_memb_entries >= 1); return (agreed); @@ -3617,6 +3627,11 @@ printf ("token seq %d\n", token->seq); instance->my_aru_count = 0; } + /* + * We really don't follow specification there. In specification, OTHER nodes + * detect failure of one node (based on aru_count) and my_id IS NEVER added + * to failed list (so node never mark itself as failed) + */ if (instance->my_aru_count > instance->totem_config->fail_to_recv_const && token->aru_addr == instance->my_id.addr[0].nodeid) { -- 1.7.1 _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss