Re: It is sometimes judged to be node trouble.

renayama19661014@xxxxxxxxx · Wed, 11 Mar 2015 06:46:35 +0900 (JST)

Hi Christine,

Thank you for comments!

> Yes, I think I can see what's happening here. JOIN messages get
> discarded during flushing because that can cause entry into GATHER state
> at an inappropriate time. For a normal JOIN message that's fine because
> the joining node will re-send the message. But this also causes LEAVE
> messages to be discarded too (as they are a special case of JOIN). This
> causes the error you are seeing.
> 
> The fix is non-trivial, sadly, but I'm looking into it

Possibly I think that the correction for this problem is big.
However, we wish the secession of this node is not judged with trouble. We hope fix it....

Best Regards,
Hideo Yamauchi.

----- Original Message -----
> From: Christine Caulfield <ccaulfie@xxxxxxxxxx>
> To: discuss@xxxxxxxxxxxx
> Cc: 
> Date: 2015/3/10, Tue 21:55
> Subject: Re:  It is sometimes judged to be node trouble.
> 
> On 09/03/15 01:09, renayama19661014@xxxxxxxxx wrote:
>>  Hi All,
>> 
>>  We constitute a cluster in corosync.
>>  We shutdown one node afterwards.
>> 
>>  Then the node that we shutdown is sometimes judged with fail by a cluster.
>> 
>>  ---------------------------------------
>>  Oct 21 11:03:30 XXX corosync[21677]: [TOTEM ] A processor failed, forming 
> new configuration.
>>  ---------------------------------------
>> 
>>  This phenomenon seems to occur with very low probability.
>> 
>>  We think that it is a problem that there is the log that the node that we 
> shutdown is taken as trouble.
>> 
>>  The problem is because the leave message(memb_leave_message_send) which is 
> sent when a user stops corosync may be thrown away.
>> 
>>  static int net_deliver_fn (
>>  int fd,
>>  int revents,
>>  void *data)
>>  {
>>  struct totemudp_instance *instance = (struct totemudp_instance *)data;
>>  struct msghdr msg_recv;
>>  struct iovec *iovec;
>>  (snip)
>>  /*
>>   * Drop all non-mcast messages (more specifically join
>>   * messages should be dropped)
>>   */
>>  message_type = (char *)iovec->iov_base;
>>  if (instance->flushing == 1 && *message_type == 
> MESSAGE_TYPE_MEMB_JOIN) {
>>  iovec->iov_len = FRAME_SIZE_MAX;
>>  return (0);
>>  }
>>  (snip)
>> 
>>  A secession leave is handled definitely and wishes a node stops.
>>  Is the correction of the handling of problem of corosync possible?
>>   * We think that it is a problem that there is the log that the node that 
> we shutdown is taken as trouble.
> 
> Yes, I think I can see what's happening here. JOIN messages get
> discarded during flushing because that can cause entry into GATHER state
> at an inappropriate time. For a normal JOIN message that's fine because
> the joining node will re-send the message. But this also causes LEAVE
> messages to be discarded too (as they are a special case of JOIN). This
> causes the error you are seeing.
> 
> The fix is non-trivial, sadly, but I'm looking into it
> 
> Thanks for the report!
> 
> Chrissie
> 
> 
> 
>>  We hope that this problem is revised in the next version if possible.
>> 
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>>  _______________________________________________
>>  discuss mailing list
>>  discuss@xxxxxxxxxxxx
>>  http://lists.corosync.org/mailman/listinfo/discuss
>> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss