Re: It is sometimes judged to be node trouble.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09/03/15 01:09, renayama19661014@xxxxxxxxx wrote:
> Hi All,
> 
> We constitute a cluster in corosync.
> We shutdown one node afterwards.
> 
> Then the node that we shutdown is sometimes judged with fail by a cluster.
> 
> ---------------------------------------
> Oct 21 11:03:30 XXX corosync[21677]: [TOTEM ] A processor failed, forming new configuration.
> ---------------------------------------
> 
> This phenomenon seems to occur with very low probability.
> 
> We think that it is a problem that there is the log that the node that we shutdown is taken as trouble.
> 
> The problem is because the leave message(memb_leave_message_send) which is sent when a user stops corosync may be thrown away.
> 
> static int net_deliver_fn (
> int fd,
> int revents,
> void *data)
> {
> struct totemudp_instance *instance = (struct totemudp_instance *)data;
> struct msghdr msg_recv;
> struct iovec *iovec;
> (snip)
> /*
>  * Drop all non-mcast messages (more specifically join
>  * messages should be dropped)
>  */
> message_type = (char *)iovec->iov_base;
> if (instance->flushing == 1 && *message_type == MESSAGE_TYPE_MEMB_JOIN) {
> iovec->iov_len = FRAME_SIZE_MAX;
> return (0);
> }
> (snip)
> 
> A secession leave is handled definitely and wishes a node stops.
> Is the correction of the handling of problem of corosync possible?
>  * We think that it is a problem that there is the log that the node that we shutdown is taken as trouble.

Yes, I think I can see what's happening here. JOIN messages get
discarded during flushing because that can cause entry into GATHER state
at an inappropriate time. For a normal JOIN message that's fine because
the joining node will re-send the message. But this also causes LEAVE
messages to be discarded too (as they are a special case of JOIN). This
causes the error you are seeing.

The fix is non-trivial, sadly, but I'm looking into it

Thanks for the report!

Chrissie



> We hope that this problem is revised in the next version if possible.
> 
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss




[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux