Re: Strange behavior of timer_function_netif_check_timeout()

jason <huzhijiang@xxxxxxxxx> · Fri, 4 Jan 2013 15:12:01 +0800

Hi Jan,

Thanks for the information. I will switch to corosync-1.4.4.
On Jan 3, 2013 12:08 AM, "Jan Friesse" <jfriesse@xxxxxxxxxx> wrote:

Jason,

ifdown on interface is really NO NO NO for corosync. Just don't do that.

Corosync then behaves in extra weird way. It's even worse with RRP.

Whole idea of rebinding to 127.0.0.1 is simply BAD.

IF you want to test network failure, you MUST use version >= 1.4.5 +

iptables, or simulate network failure on switch (this works with any

version).

Regards,

  Honza

jason napsal(a):

> Hi All,

> When node bound to 127.0.0.1 because of last interface down, and

> timer_function_netif_check_timeout() kicks in before we reach consensus,

> then netif_determine() in timer_function_netif_check_timeout() will cause

> instance->totem_interface->boundto be reset back to

> instance->totem_interface->bindnet, say it is 192.168.1.1.  As a result,

> after the single node ring is created, any one do things like

> api->totem_iface_get(api->totem_nodeid_get()...) will fail because

> api->totem_nodeid_get() returns 192.168.1.1 which is not in my_memb_entries

> or my_left_memb_list. This failure directly cause segment fault when

> calling my_cluster_node_load() in CLM in my environment.

>

> This segment fault can be reproduced by starting aisexec(openais-1.1.4)

> with CLM service loaded only, then ifdown the network interface configured

> in corosync.conf,thus ,192.168.1.1 in the above example.

>

> I think the straight forward method to resolve this issue is let

> timer_function_netif_check_timeout() to call netif_determine() with a

> temporary local variable but not  instance->totem_interface->boundto. So

> that boundto can stay still as 127.0.0.1.

>

> More over, by digging this issue deeply, I seems found the reason why

> timer_function_netif_check_timeout() can kick in before reaching consensus

> is that, in memb_state_gather_enter(), we have two members in my_proc_list

> when we try to build the single node ring. One is 127.0.0.1 another  is

> 192.168.1.1!  And we can not expected to receive JOIN message  from

> 192.168.1.1. It is a redundant item in this case which makes consensus

> timed out. So I think it should be filtered out by

> memb_state_gather_enter() in this case.

>

> So I tried to filter 192.168.1.1 out from my_proc_list in that case then I

> saw consensus reached so quickly without timed out, the new ring built so

> quickly that my_cluster_node_load() in CLM was done correctly this time

> before timer_function_netif_check_timeout() kicks in. So no  matter

> timer_function_netif_check_timeout() resets

> instance->totem_interface->boundto or not, segment fault has no opportunity

> to happen.

>

> Please take a look at these issues.

>

>

>

>

> _______________________________________________

> discuss mailing list

> discuss@xxxxxxxxxxxx

> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss