Hi Christine, Thank you for comments. > I've been looking at it and it's very, very complex to fix. The code you > mentioned that discards the LEAVE (and JOIN) messages was put there to > avoid a nasty crash and I'm struggling to find a solution that doesn't > just re-introduce that bug. As you suggest, a message might be helpful > though. Yes. I understand that this correction is very difficult as you say. Surely log will be to help a system-manager. * The system-manager can suppose that it may be a cause of "A Processor failed..." that at least LEAVE was thrown away. Best Regards, Hideo Yamauchi. ----- Original Message ----- > From: Christine Caulfield <ccaulfie@xxxxxxxxxx> > To: discuss@xxxxxxxxxxxx > Cc: > Date: 2015/4/17, Fri 16:52 > Subject: Re: It is sometimes judged to be node trouble. > > On 16/04/15 07:12, renayama19661014@xxxxxxxxx wrote: >> Hi Christine, >> >> Does the correction for this problem have progress? >> >> We think that it has a problem very much that a system-manager is confused > in this log. >> * The outbreak frequency of this problem is low, but the impact to a > system-manager is big when it occurs. > > I've been looking at it and it's very, very complex to fix. The code you > mentioned that discards the LEAVE (and JOIN) messages was put there to > avoid a nasty crash and I'm struggling to find a solution that doesn't > just re-introduce that bug. As you suggest, a message might be helpful > though. > > Chrissie > >> The log may evade the confusion of the system-manager if the log that > canceled LEAVE message appears right now even if a correction is difficult. >> >> >>>> static int net_deliver_fn ( >>>> int fd, >>>> int revents, >>>> void *data) >>>> { >>>> struct totemudp_instance *instance = (struct totemudp_instance > *)data; >>>> struct msghdr msg_recv; >>>> struct iovec *iovec; >>>> (snip) >>>> /* >>>> * Drop all non-mcast messages (more specifically join >>>> * messages should be dropped) >>>> */ >>>> message_type = (char *)iovec->iov_base; >>>> if (instance->flushing == 1 && *message_type == >>> MESSAGE_TYPE_MEMB_JOIN) { >>>> iovec->iov_len = FRAME_SIZE_MAX; >> >> ------> I think that some kind of log should appear here. >> >>>> return (0); >>>> } >>>> (snip) >> >> Best Regards, >> Hideo Yamauchi. >> >> >> >> ----- Original Message ----- >>> From: "renayama19661014@xxxxxxxxx" > <renayama19661014@xxxxxxxxx> >>> To: Christine Caulfield <ccaulfie@xxxxxxxxxx>; > "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx> >>> Cc: >>> Date: 2015/3/11, Wed 06:46 >>> Subject: Re: It is sometimes judged to be node trouble. >>> >>> Hi Christine, >>> >>> Thank you for comments! >>> >>>> Yes, I think I can see what's happening here. JOIN messages > get >>>> discarded during flushing because that can cause entry into GATHER > state >>>> at an inappropriate time. For a normal JOIN message that's > fine because >>>> the joining node will re-send the message. But this also causes > LEAVE >>>> messages to be discarded too (as they are a special case of JOIN). > This >>>> causes the error you are seeing. >>>> >>>> The fix is non-trivial, sadly, but I'm looking into it >>> >>> >>> Possibly I think that the correction for this problem is big. >>> However, we wish the secession of this node is not judged with trouble. > We hope >>> fix it.... >>> >>> Best Regards, >>> Hideo Yamauchi. >>> >>> >>> >>> >>> ----- Original Message ----- >>>> From: Christine Caulfield <ccaulfie@xxxxxxxxxx> >>>> To: discuss@xxxxxxxxxxxx >>>> Cc: >>>> Date: 2015/3/10, Tue 21:55 >>>> Subject: Re: It is sometimes judged to be node trouble. >>>> >>>> On 09/03/15 01:09, renayama19661014@xxxxxxxxx wrote: >>>>> Hi All, >>>>> >>>>> We constitute a cluster in corosync. >>>>> We shutdown one node afterwards. >>>>> >>>>> Then the node that we shutdown is sometimes judged with fail > by a >>> cluster. >>>>> >>>>> --------------------------------------- >>>>> Oct 21 11:03:30 XXX corosync[21677]: [TOTEM ] A processor > failed, >>> forming >>>> new configuration. >>>>> --------------------------------------- >>>>> >>>>> This phenomenon seems to occur with very low probability. >>>>> >>>>> We think that it is a problem that there is the log that the > node that >>> we >>>> shutdown is taken as trouble. >>>>> >>>>> The problem is because the leave > message(memb_leave_message_send) >>> which is >>>> sent when a user stops corosync may be thrown away. >>>>> >>>>> static int net_deliver_fn ( >>>>> int fd, >>>>> int revents, >>>>> void *data) >>>>> { >>>>> struct totemudp_instance *instance = (struct > totemudp_instance *)data; >>>>> struct msghdr msg_recv; >>>>> struct iovec *iovec; >>>>> (snip) >>>>> /* >>>>> * Drop all non-mcast messages (more specifically join >>>>> * messages should be dropped) >>>>> */ >>>>> message_type = (char *)iovec->iov_base; >>>>> if (instance->flushing == 1 && *message_type == >>>> MESSAGE_TYPE_MEMB_JOIN) { >>>>> iovec->iov_len = FRAME_SIZE_MAX; >>>>> return (0); >>>>> } >>>>> (snip) >>>>> >>>>> A secession leave is handled definitely and wishes a node > stops. >>>>> Is the correction of the handling of problem of corosync > possible? >>>>> * We think that it is a problem that there is the log that > the node >>> that >>>> we shutdown is taken as trouble. >>>> >>>> Yes, I think I can see what's happening here. JOIN messages > get >>>> discarded during flushing because that can cause entry into GATHER > state >>>> at an inappropriate time. For a normal JOIN message that's > fine because >>>> the joining node will re-send the message. But this also causes > LEAVE >>>> messages to be discarded too (as they are a special case of JOIN). > This >>>> causes the error you are seeing. >>>> >>>> The fix is non-trivial, sadly, but I'm looking into it >>>> >>>> Thanks for the report! >>>> >>>> Chrissie >>>> >>>> >>>> >>>>> We hope that this problem is revised in the next version if > possible. >>>>> >>>>> >>>>> Best Regards, >>>>> Hideo Yamauchi. >>>>> >> > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss