Hi Christine, Thank you for comments! > Yes, I think I can see what's happening here. JOIN messages get > discarded during flushing because that can cause entry into GATHER state > at an inappropriate time. For a normal JOIN message that's fine because > the joining node will re-send the message. But this also causes LEAVE > messages to be discarded too (as they are a special case of JOIN). This > causes the error you are seeing. > > The fix is non-trivial, sadly, but I'm looking into it Possibly I think that the correction for this problem is big. However, we wish the secession of this node is not judged with trouble. We hope fix it.... Best Regards, Hideo Yamauchi. ----- Original Message ----- > From: Christine Caulfield <ccaulfie@xxxxxxxxxx> > To: discuss@xxxxxxxxxxxx > Cc: > Date: 2015/3/10, Tue 21:55 > Subject: Re: It is sometimes judged to be node trouble. > > On 09/03/15 01:09, renayama19661014@xxxxxxxxx wrote: >> Hi All, >> >> We constitute a cluster in corosync. >> We shutdown one node afterwards. >> >> Then the node that we shutdown is sometimes judged with fail by a cluster. >> >> --------------------------------------- >> Oct 21 11:03:30 XXX corosync[21677]: [TOTEM ] A processor failed, forming > new configuration. >> --------------------------------------- >> >> This phenomenon seems to occur with very low probability. >> >> We think that it is a problem that there is the log that the node that we > shutdown is taken as trouble. >> >> The problem is because the leave message(memb_leave_message_send) which is > sent when a user stops corosync may be thrown away. >> >> static int net_deliver_fn ( >> int fd, >> int revents, >> void *data) >> { >> struct totemudp_instance *instance = (struct totemudp_instance *)data; >> struct msghdr msg_recv; >> struct iovec *iovec; >> (snip) >> /* >> * Drop all non-mcast messages (more specifically join >> * messages should be dropped) >> */ >> message_type = (char *)iovec->iov_base; >> if (instance->flushing == 1 && *message_type == > MESSAGE_TYPE_MEMB_JOIN) { >> iovec->iov_len = FRAME_SIZE_MAX; >> return (0); >> } >> (snip) >> >> A secession leave is handled definitely and wishes a node stops. >> Is the correction of the handling of problem of corosync possible? >> * We think that it is a problem that there is the log that the node that > we shutdown is taken as trouble. > > Yes, I think I can see what's happening here. JOIN messages get > discarded during flushing because that can cause entry into GATHER state > at an inappropriate time. For a normal JOIN message that's fine because > the joining node will re-send the message. But this also causes LEAVE > messages to be discarded too (as they are a special case of JOIN). This > causes the error you are seeing. > > The fix is non-trivial, sadly, but I'm looking into it > > Thanks for the report! > > Chrissie > > > >> We hope that this problem is revised in the next version if possible. >> >> >> Best Regards, >> Hideo Yamauchi. >> >> >> _______________________________________________ >> discuss mailing list >> discuss@xxxxxxxxxxxx >> http://lists.corosync.org/mailman/listinfo/discuss >> > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss