Re: It is sometimes judged to be node trouble.

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Fri, 17 Apr 2015 08:52:44 +0100

On 16/04/15 07:12, renayama19661014@xxxxxxxxx wrote:
> Hi Christine,
> 
> Does the correction for this problem have progress?
> 
> We think that it has a problem very much that a system-manager is confused in this log.
>  * The outbreak frequency of this problem is low, but the impact to a system-manager is big when it occurs.

I've been looking at it and it's very, very complex to fix. The code you
mentioned that discards the LEAVE (and JOIN) messages was put there to
avoid a nasty crash and I'm struggling to find a solution that doesn't
just re-introduce that bug. As you suggest, a message might be helpful
though.

Chrissie

> The log may evade the confusion of the system-manager if the log that canceled LEAVE message appears right now even if a correction is difficult.
> 
> 
>>>  static int net_deliver_fn (
>>>  int fd,
>>>  int revents,
>>>  void *data)
>>>  {
>>>  struct totemudp_instance *instance = (struct totemudp_instance *)data;
>>>  struct msghdr msg_recv;
>>>  struct iovec *iovec;
>>>  (snip)
>>>  /*
>>>   * Drop all non-mcast messages (more specifically join
>>>   * messages should be dropped)
>>>   */
>>>  message_type = (char *)iovec->iov_base;
>>>  if (instance->flushing == 1 && *message_type == 
>> MESSAGE_TYPE_MEMB_JOIN) {
>>>  iovec->iov_len = FRAME_SIZE_MAX;
> 
>  ------> I think that some kind of log should appear here.
> 
>>>  return (0);
>>>  }
>>>  (snip)
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> ----- Original Message -----
>> From: "renayama19661014@xxxxxxxxx" <renayama19661014@xxxxxxxxx>
>> To: Christine Caulfield <ccaulfie@xxxxxxxxxx>; "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx>
>> Cc: 
>> Date: 2015/3/11, Wed 06:46
>> Subject: Re:  It is sometimes judged to be node trouble.
>>
>> Hi Christine,
>>
>> Thank you for comments!
>>
>>>  Yes, I think I can see what's happening here. JOIN messages get
>>>  discarded during flushing because that can cause entry into GATHER state
>>>  at an inappropriate time. For a normal JOIN message that's fine because
>>>  the joining node will re-send the message. But this also causes LEAVE
>>>  messages to be discarded too (as they are a special case of JOIN). This
>>>  causes the error you are seeing.
>>>  
>>>  The fix is non-trivial, sadly, but I'm looking into it
>>
>>
>> Possibly I think that the correction for this problem is big.
>> However, we wish the secession of this node is not judged with trouble. We hope 
>> fix it....
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>>
>>
>>
>> ----- Original Message -----
>>>  From: Christine Caulfield <ccaulfie@xxxxxxxxxx>
>>>  To: discuss@xxxxxxxxxxxx
>>>  Cc: 
>>>  Date: 2015/3/10, Tue 21:55
>>>  Subject: Re:  It is sometimes judged to be node trouble.
>>>
>>>  On 09/03/15 01:09, renayama19661014@xxxxxxxxx wrote:
>>>>   Hi All,
>>>>
>>>>   We constitute a cluster in corosync.
>>>>   We shutdown one node afterwards.
>>>>
>>>>   Then the node that we shutdown is sometimes judged with fail by a 
>> cluster.
>>>>
>>>>   ---------------------------------------
>>>>   Oct 21 11:03:30 XXX corosync[21677]: [TOTEM ] A processor failed, 
>> forming 
>>>  new configuration.
>>>>   ---------------------------------------
>>>>
>>>>   This phenomenon seems to occur with very low probability.
>>>>
>>>>   We think that it is a problem that there is the log that the node that 
>> we 
>>>  shutdown is taken as trouble.
>>>>
>>>>   The problem is because the leave message(memb_leave_message_send) 
>> which is 
>>>  sent when a user stops corosync may be thrown away.
>>>>
>>>>   static int net_deliver_fn (
>>>>   int fd,
>>>>   int revents,
>>>>   void *data)
>>>>   {
>>>>   struct totemudp_instance *instance = (struct totemudp_instance *)data;
>>>>   struct msghdr msg_recv;
>>>>   struct iovec *iovec;
>>>>   (snip)
>>>>   /*
>>>>    * Drop all non-mcast messages (more specifically join
>>>>    * messages should be dropped)
>>>>    */
>>>>   message_type = (char *)iovec->iov_base;
>>>>   if (instance->flushing == 1 && *message_type == 
>>>  MESSAGE_TYPE_MEMB_JOIN) {
>>>>   iovec->iov_len = FRAME_SIZE_MAX;
>>>>   return (0);
>>>>   }
>>>>   (snip)
>>>>
>>>>   A secession leave is handled definitely and wishes a node stops.
>>>>   Is the correction of the handling of problem of corosync possible?
>>>>    * We think that it is a problem that there is the log that the node 
>> that 
>>>  we shutdown is taken as trouble.
>>>
>>>  Yes, I think I can see what's happening here. JOIN messages get
>>>  discarded during flushing because that can cause entry into GATHER state
>>>  at an inappropriate time. For a normal JOIN message that's fine because
>>>  the joining node will re-send the message. But this also causes LEAVE
>>>  messages to be discarded too (as they are a special case of JOIN). This
>>>  causes the error you are seeing.
>>>
>>>  The fix is non-trivial, sadly, but I'm looking into it
>>>
>>>  Thanks for the report!
>>>
>>>  Chrissie
>>>
>>>
>>>
>>>>   We hope that this problem is revised in the next version if possible.
>>>>
>>>>
>>>>   Best Regards,
>>>>   Hideo Yamauchi.
>>>>
>
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss