Strange behavior of timer_function_netif_check_timeout()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,
When node bound to 127.0.0.1 because of last interface down, and timer_function_netif_check_timeout() kicks in before we reach consensus, then netif_determine() in timer_function_netif_check_timeout() will cause instance->totem_interface->boundto be reset back to instance->totem_interface->bindnet, say it is 192.168.1.1.  As a result, after the single node ring is created, any one do things like api->totem_iface_get(api->totem_nodeid_get()...) will fail because api->totem_nodeid_get() returns 192.168.1.1 which is not in my_memb_entries or my_left_memb_list. This failure directly cause segment fault when calling my_cluster_node_load() in CLM in my environment.

This segment fault can be reproduced by starting aisexec(openais-1.1.4) with CLM service loaded only, then ifdown the network interface configured in corosync.conf,thus ,192.168.1.1 in the above example.

I think the straight forward method to resolve this issue is let timer_function_netif_check_timeout() to call netif_determine() with a temporary local variable but not  instance->totem_interface->boundto. So that boundto can stay still as 127.0.0.1.

More over, by digging this issue deeply, I seems found the reason why timer_function_netif_check_timeout() can kick in before reaching consensus is that, in memb_state_gather_enter(), we have two members in my_proc_list when we try to build the single node ring. One is 127.0.0.1 another  is 192.168.1.1!  And we can not expected to receive JOIN message  from 192.168.1.1. It is a redundant item in this case which makes consensus timed out. So I think it should be filtered out by memb_state_gather_enter() in this case.

So I tried to filter 192.168.1.1 out from my_proc_list in that case then I saw consensus reached so quickly without timed out, the new ring built so quickly that my_cluster_node_load() in CLM was done correctly this time before timer_function_netif_check_timeout() kicks in. So no  matter timer_function_netif_check_timeout() resets  instance->totem_interface->boundto or not, segment fault has no opportunity to happen.

Please take a look at these issues.

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux