Re: It is sometimes judged to be node trouble.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Christine,


Thank you for comments.

> I've been looking at it and it's very, very complex to fix. The code you
> mentioned that discards the LEAVE (and JOIN) messages was put there to
> avoid a nasty crash and I'm struggling to find a solution that doesn't
> just re-introduce that bug. As you suggest, a message might be helpful
> though.



Yes.
I understand that this correction is very difficult as you say.

Surely log will be to help a system-manager.
 * The system-manager can suppose that it may be a cause of "A Processor failed..." that at least LEAVE was thrown away.

Best Regards,
Hideo Yamauchi.


----- Original Message -----
> From: Christine Caulfield <ccaulfie@xxxxxxxxxx>
> To: discuss@xxxxxxxxxxxx
> Cc: 
> Date: 2015/4/17, Fri 16:52
> Subject: Re:  It is sometimes judged to be node trouble.
> 
> On 16/04/15 07:12, renayama19661014@xxxxxxxxx wrote:
>>  Hi Christine,
>> 
>>  Does the correction for this problem have progress?
>> 
>>  We think that it has a problem very much that a system-manager is confused 
> in this log.
>>   * The outbreak frequency of this problem is low, but the impact to a 
> system-manager is big when it occurs.
> 
> I've been looking at it and it's very, very complex to fix. The code you
> mentioned that discards the LEAVE (and JOIN) messages was put there to
> avoid a nasty crash and I'm struggling to find a solution that doesn't
> just re-introduce that bug. As you suggest, a message might be helpful
> though.
> 
> Chrissie
> 
>>  The log may evade the confusion of the system-manager if the log that 
> canceled LEAVE message appears right now even if a correction is difficult.
>> 
>> 
>>>>   static int net_deliver_fn (
>>>>   int fd,
>>>>   int revents,
>>>>   void *data)
>>>>   {
>>>>   struct totemudp_instance *instance = (struct totemudp_instance 
> *)data;
>>>>   struct msghdr msg_recv;
>>>>   struct iovec *iovec;
>>>>   (snip)
>>>>   /*
>>>>    * Drop all non-mcast messages (more specifically join
>>>>    * messages should be dropped)
>>>>    */
>>>>   message_type = (char *)iovec->iov_base;
>>>>   if (instance->flushing == 1 && *message_type == 
>>>  MESSAGE_TYPE_MEMB_JOIN) {
>>>>   iovec->iov_len = FRAME_SIZE_MAX;
>> 
>>   ------> I think that some kind of log should appear here.
>> 
>>>>   return (0);
>>>>   }
>>>>   (snip)
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>> 
>>  ----- Original Message -----
>>>  From: "renayama19661014@xxxxxxxxx" 
> <renayama19661014@xxxxxxxxx>
>>>  To: Christine Caulfield <ccaulfie@xxxxxxxxxx>; 
> "discuss@xxxxxxxxxxxx" <discuss@xxxxxxxxxxxx>
>>>  Cc: 
>>>  Date: 2015/3/11, Wed 06:46
>>>  Subject: Re:  It is sometimes judged to be node trouble.
>>> 
>>>  Hi Christine,
>>> 
>>>  Thank you for comments!
>>> 
>>>>   Yes, I think I can see what's happening here. JOIN messages 
> get
>>>>   discarded during flushing because that can cause entry into GATHER 
> state
>>>>   at an inappropriate time. For a normal JOIN message that's 
> fine because
>>>>   the joining node will re-send the message. But this also causes 
> LEAVE
>>>>   messages to be discarded too (as they are a special case of JOIN). 
> This
>>>>   causes the error you are seeing.
>>>>   
>>>>   The fix is non-trivial, sadly, but I'm looking into it
>>> 
>>> 
>>>  Possibly I think that the correction for this problem is big.
>>>  However, we wish the secession of this node is not judged with trouble. 
> We hope 
>>>  fix it....
>>> 
>>>  Best Regards,
>>>  Hideo Yamauchi.
>>> 
>>> 
>>> 
>>> 
>>>  ----- Original Message -----
>>>>   From: Christine Caulfield <ccaulfie@xxxxxxxxxx>
>>>>   To: discuss@xxxxxxxxxxxx
>>>>   Cc: 
>>>>   Date: 2015/3/10, Tue 21:55
>>>>   Subject: Re:  It is sometimes judged to be node trouble.
>>>> 
>>>>   On 09/03/15 01:09, renayama19661014@xxxxxxxxx wrote:
>>>>>    Hi All,
>>>>> 
>>>>>    We constitute a cluster in corosync.
>>>>>    We shutdown one node afterwards.
>>>>> 
>>>>>    Then the node that we shutdown is sometimes judged with fail 
> by a 
>>>  cluster.
>>>>> 
>>>>>    ---------------------------------------
>>>>>    Oct 21 11:03:30 XXX corosync[21677]: [TOTEM ] A processor 
> failed, 
>>>  forming 
>>>>   new configuration.
>>>>>    ---------------------------------------
>>>>> 
>>>>>    This phenomenon seems to occur with very low probability.
>>>>> 
>>>>>    We think that it is a problem that there is the log that the 
> node that 
>>>  we 
>>>>   shutdown is taken as trouble.
>>>>> 
>>>>>    The problem is because the leave 
> message(memb_leave_message_send) 
>>>  which is 
>>>>   sent when a user stops corosync may be thrown away.
>>>>> 
>>>>>    static int net_deliver_fn (
>>>>>    int fd,
>>>>>    int revents,
>>>>>    void *data)
>>>>>    {
>>>>>    struct totemudp_instance *instance = (struct 
> totemudp_instance *)data;
>>>>>    struct msghdr msg_recv;
>>>>>    struct iovec *iovec;
>>>>>    (snip)
>>>>>    /*
>>>>>     * Drop all non-mcast messages (more specifically join
>>>>>     * messages should be dropped)
>>>>>     */
>>>>>    message_type = (char *)iovec->iov_base;
>>>>>    if (instance->flushing == 1 && *message_type == 
>>>>   MESSAGE_TYPE_MEMB_JOIN) {
>>>>>    iovec->iov_len = FRAME_SIZE_MAX;
>>>>>    return (0);
>>>>>    }
>>>>>    (snip)
>>>>> 
>>>>>    A secession leave is handled definitely and wishes a node 
> stops.
>>>>>    Is the correction of the handling of problem of corosync 
> possible?
>>>>>     * We think that it is a problem that there is the log that 
> the node 
>>>  that 
>>>>   we shutdown is taken as trouble.
>>>> 
>>>>   Yes, I think I can see what's happening here. JOIN messages 
> get
>>>>   discarded during flushing because that can cause entry into GATHER 
> state
>>>>   at an inappropriate time. For a normal JOIN message that's 
> fine because
>>>>   the joining node will re-send the message. But this also causes 
> LEAVE
>>>>   messages to be discarded too (as they are a special case of JOIN). 
> This
>>>>   causes the error you are seeing.
>>>> 
>>>>   The fix is non-trivial, sadly, but I'm looking into it
>>>> 
>>>>   Thanks for the report!
>>>> 
>>>>   Chrissie
>>>> 
>>>> 
>>>> 
>>>>>    We hope that this problem is revised in the next version if 
> possible.
>>>>> 
>>>>> 
>>>>>    Best Regards,
>>>>>    Hideo Yamauchi.
>>>>> 
>> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss





[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux