Strange behavior of timer_function_netif_check_timeout()

jason <huzhijiang@xxxxxxxxx> · Fri, 14 Dec 2012 17:28:39 +0800

Hi All,

When node bound to 127.0.0.1 because of last interface down, and timer_function_netif_check_timeout() kicks in before we reach consensus, then netif_determine() in timer_function_netif_check_timeout() will cause instance->totem_interface->boundto be reset back to instance->totem_interface->bindnet, say it is 192.168.1.1.  As a result, after the single node ring is created, any one do things like api->totem_iface_get(api->totem_nodeid_get()...) will fail because api->totem_nodeid_get() returns 192.168.1.1 which is not in my_memb_entries or my_left_memb_list. This failure directly cause segment fault when calling my_cluster_node_load() in CLM in my environment.

This segment fault can be reproduced by starting aisexec(openais-1.1.4) with CLM service loaded only, then ifdown the network interface configured in corosync.conf,thus ,192.168.1.1 in the above example.
I think the straight forward method to resolve this issue is let timer_function_netif_check_timeout() to call netif_determine() with a temporary local variable but not  instance->totem_interface->boundto. So that boundto can stay still as 127.0.0.1. 

More over, by digging this issue deeply, I seems found the reason why timer_function_netif_check_timeout() can kick in before reaching consensus is that, in memb_state_gather_enter(), we have two members in my_proc_list when we try to build the single node ring. One is 127.0.0.1 another  is 192.168.1.1!  And we can not expected to receive JOIN message  from 192.168.1.1. It is a redundant item in this case which makes consensus timed out. So I think it should be filtered out by memb_state_gather_enter() in this case. 

So I tried to filter 192.168.1.1 out from my_proc_list in that case then I saw consensus reached so quickly without timed out, the new ring built so quickly that my_cluster_node_load() in CLM was done correctly this time before timer_function_netif_check_timeout() kicks in. So no  matter timer_function_netif_check_timeout() resets  instance->totem_interface->boundto or not, segment fault has no opportunity to happen. 

Please take a look at these issues. 
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss