Hi Christine, Does the correction for this problem have progress? We think that it has a problem very much that a system-manager is confused in this log. * The outbreak frequency of this problem is low, but the impact to a system-manager is big when it occurs. The log may evade the confusion of the system-manager if the log that canceled LEAVE message appears right now even if a correction is difficult. >> static int net_deliver_fn ( >> int fd, >> int revents, >> void *data) >> { >> struct totemudp_instance *instance = (struct totemudp_instance *)data; >> struct msghdr msg_recv; >> struct iovec *iovec; >> (snip) >> /* >> * Drop all non-mcast messages (more specifically join >> * messages should be dropped) >> */ >> message_type = (char *)iovec->iov_base; >> if (instance->flushing == 1 && *message_type == > MESSAGE_TYPE_MEMB_JOIN) { >> iovec->iov_len = FRAME_SIZE_MAX; ------> I think that some kind of log should appear here. >> return (0); >> } >> (snip) Best Regards, Hideo Yamauchi. ----- Original Message ----- > From: "renayama19661014@xxxxxxxxx" <renayama19661014@xxxxxxxxx> > To: Christine Caulfield <ccaulfie@xxxxxxxxxx>; "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx> > Cc: > Date: 2015/3/11, Wed 06:46 > Subject: Re: It is sometimes judged to be node trouble. > > Hi Christine, > > Thank you for comments! > >> Yes, I think I can see what's happening here. JOIN messages get >> discarded during flushing because that can cause entry into GATHER state >> at an inappropriate time. For a normal JOIN message that's fine because >> the joining node will re-send the message. But this also causes LEAVE >> messages to be discarded too (as they are a special case of JOIN). This >> causes the error you are seeing. >> >> The fix is non-trivial, sadly, but I'm looking into it > > > Possibly I think that the correction for this problem is big. > However, we wish the secession of this node is not judged with trouble. We hope > fix it.... > > Best Regards, > Hideo Yamauchi. > > > > > ----- Original Message ----- >> From: Christine Caulfield <ccaulfie@xxxxxxxxxx> >> To: discuss@xxxxxxxxxxxx >> Cc: >> Date: 2015/3/10, Tue 21:55 >> Subject: Re: It is sometimes judged to be node trouble. >> >> On 09/03/15 01:09, renayama19661014@xxxxxxxxx wrote: >>> Hi All, >>> >>> We constitute a cluster in corosync. >>> We shutdown one node afterwards. >>> >>> Then the node that we shutdown is sometimes judged with fail by a > cluster. >>> >>> --------------------------------------- >>> Oct 21 11:03:30 XXX corosync[21677]: [TOTEM ] A processor failed, > forming >> new configuration. >>> --------------------------------------- >>> >>> This phenomenon seems to occur with very low probability. >>> >>> We think that it is a problem that there is the log that the node that > we >> shutdown is taken as trouble. >>> >>> The problem is because the leave message(memb_leave_message_send) > which is >> sent when a user stops corosync may be thrown away. >>> >>> static int net_deliver_fn ( >>> int fd, >>> int revents, >>> void *data) >>> { >>> struct totemudp_instance *instance = (struct totemudp_instance *)data; >>> struct msghdr msg_recv; >>> struct iovec *iovec; >>> (snip) >>> /* >>> * Drop all non-mcast messages (more specifically join >>> * messages should be dropped) >>> */ >>> message_type = (char *)iovec->iov_base; >>> if (instance->flushing == 1 && *message_type == >> MESSAGE_TYPE_MEMB_JOIN) { >>> iovec->iov_len = FRAME_SIZE_MAX; >>> return (0); >>> } >>> (snip) >>> >>> A secession leave is handled definitely and wishes a node stops. >>> Is the correction of the handling of problem of corosync possible? >>> * We think that it is a problem that there is the log that the node > that >> we shutdown is taken as trouble. >> >> Yes, I think I can see what's happening here. JOIN messages get >> discarded during flushing because that can cause entry into GATHER state >> at an inappropriate time. For a normal JOIN message that's fine because >> the joining node will re-send the message. But this also causes LEAVE >> messages to be discarded too (as they are a special case of JOIN). This >> causes the error you are seeing. >> >> The fix is non-trivial, sadly, but I'm looking into it >> >> Thanks for the report! >> >> Chrissie >> >> >> >>> We hope that this problem is revised in the next version if possible. >>> >>> >>> Best Regards, >>> Hideo Yamauchi. >>> >>> >>> _______________________________________________ >>> discuss mailing list >>> discuss@xxxxxxxxxxxx >>> http://lists.corosync.org/mailman/listinfo/discuss >>> >> >> _______________________________________________ >> discuss mailing list >> discuss@xxxxxxxxxxxx >> http://lists.corosync.org/mailman/listinfo/discuss >> > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss