Re: totemsrp: Cancel token holding while in retransmition

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Thu, 07 Aug 2014 14:35:48 +0100

On 06/08/14 02:09, jason wrote:
Sorry, the previous patch is wrong. Here is the correction.

That looks good to me and, I think, the best solution. It seems to be 
decidedly non-trivial to determine if retransmits are present when going 
into hold.

Thanks!
Chrissie

On Aug 5, 2014 10:18 PM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx
<mailto:ccaulfie@xxxxxxxxxx>> wrote:

    Hi Jason,

    Thanks for testing that - and the extra info. I'll have another
    think then. If I can't come up with anything more we might go with
    your patch.

    Chrissie

    On 05/08/14 13:01, jason wrote:

        Hi Christine,
        I have tested your patch but it can not solve my problem. By adding
        printf, I found that whenever during retransmition occured in my
        test
        case or not, the retrans_message_queue is always empty. It seems
        that
        the retrans_message_queue is for recovery state used only?

        On Aug 5, 2014 3:50 PM, "Christine Caulfield"
        <ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>
        <mailto:ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>>> wrote:

             On 01/08/14 10:50, Christine Caulfield wrote:

                 On 01/08/14 10:42, Jan Friesse wrote:

                     Jason,

                         Hi All,

                         I have encountered a problem that when there is
        no other
                         activty on
                         ring but
                         only retransmition, and token is in hold mode, the
                         retransmition will
                         become
                         slow. More over, if the retransmition is always
        fail but
                         token

                     Yes

                         rotation works well,
                         then it takes quite a lone
        time(fail_to_recv_const *
                         token_hold = 2500
                         * 180ms = 450sec) for the retransmiting node to
        meet the
                         "FAILED TO
                         RECEIVE" condition to
                         re-construct a new ring. This can be reporduced
        by the
                         following steps:

                               1) Create a two-node cluster in udpu
        transport mode.
                               2) Wait until there is no other activty
        on ring.
                               3) One, or both nodes delete each other
        in nodelist in
                         corosync.conf
                               4) corosync-cfgtool -R, this can cause a
        message
                         retransmition,
                         but I am
                               not sure why.
                               5) Since tokenrotation still works well,
        but the
                         retransmition
                         can not be
                               satisfied due to node deletion, so, only
        "FAILED
                         TO RECEIVE"
                         condition can form new
                               ring. But we need to wait 450 seconds for
        it to
                         happen. During
                         this wait,
                               we saw the following logs:

                     This is really weird case.

                               Jul 30 11:21:06 notice  [TOTEM ]
        Retransmit List: e
                               Jul 30 11:21:06 notice  [TOTEM ]
        Retransmit List: e
                               Jul 30 11:21:06 notice  [TOTEM ]
        Retransmit List: e
                               Jul 30 11:21:06 notice  [TOTEM ]
        Retransmit List: e
                               Jul 30 11:21:06 notice  [TOTEM ]
        Retransmit List: e
                               ...

                         This problem can be solved by adding
                         token_hold_cancel_send() in both
                         retransmition request and response conditions in
                         orf_token_rtr() to
                         speed up
                         retransmition. I created a patch below, any
        comments?

                     Ok. Patch looks fine, but during review I had other
        idea.
                     What about
                     prohibit starting of hold mode where there are
        messages to
                     retransmit?
                     Such solution may be cleaner, isn't it?

                     Anyway. This is change in very critical part of the
        code, so
                     Chrissie,
                     can you please take a look to patch and express
        your opinion?

                 I've been looking it over yesterday. It's a problem I have
                 definitely
                 seen myself on some VM systems so it's certainly not an
        isolated
                 case. I
                 think Honza is right that there might be a better way
        of fixing
                 it so
                 I'll have a look.

                 Chrissie

             Annoyingly my common reproducer seems not to be working and
        I can't
             get yours to make it happen either. If you can still
        reproduce it
             could you try this patch for me please?

             Chrissie

             _________________________________________________
             discuss mailing list
        discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
        <mailto:discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>>
        http://lists.corosync.org/__mailman/listinfo/discuss
        <http://lists.corosync.org/mailman/listinfo/discuss>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss