Hi All,
When node bound to 127.0.0.1 because of last interface down, and timer_function_netif_check_timeout() kicks in before we reach consensus, then netif_determine() in timer_function_netif_check_timeout() will cause instance->totem_interface->boundto be reset back to instance->totem_interface->bindnet, say it is 192.168.1.1. As a result, after the single node ring is created, any one do things like api->totem_iface_get(api->totem_nodeid_get()...) will fail because api->totem_nodeid_get() returns 192.168.1.1 which is not in my_memb_entries or my_left_memb_list. This failure directly cause segment fault when calling my_cluster_node_load() in CLM in my environment.
This segment fault can be reproduced by starting aisexec(openais-1.1.4) with CLM service loaded only, then ifdown the network interface configured in corosync.conf,thus ,192.168.1.1 in the above example.
I think the straight forward method to resolve this issue is let timer_function_netif_check_timeout() to call netif_determine() with a temporary local variable but not instance->totem_interface->boundto. So that boundto can stay still as 127.0.0.1.
More over, by digging this issue deeply, I seems found the reason why timer_function_netif_check_timeout() can kick in before reaching consensus is that, in memb_state_gather_enter(), we have two members in my_proc_list when we try to build the single node ring. One is 127.0.0.1 another is 192.168.1.1! And we can not expected to receive JOIN message from 192.168.1.1. It is a redundant item in this case which makes consensus timed out. So I think it should be filtered out by memb_state_gather_enter() in this case.
So I tried to filter 192.168.1.1 out from my_proc_list in that case then I saw consensus reached so quickly without timed out, the new ring built so quickly that my_cluster_node_load() in CLM was done correctly this time before timer_function_netif_check_timeout() kicks in. So no matter timer_function_netif_check_timeout() resets instance->totem_interface->boundto or not, segment fault has no opportunity to happen.
Please take a look at these issues.
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss