Re: totemsrp: Cancel token holding while in retransmition

jason <huzhijiang@xxxxxxxxx> · Fri, 8 Aug 2014 11:31:42 +0800

Hi Chrissie,

Thanks, I will send a mail to this mailing list about this patch.
On Aug 7, 2014 9:35 PM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx> wrote:

On 06/08/14 02:09, jason wrote:

Sorry, the previous patch is wrong. Here is the correction.

That looks good to me and, I think, the best solution. It seems to be decidedly non-trivial to determine if retransmits are present when going into hold.

Thanks!

Chrissie

On Aug 5, 2014 10:18 PM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx

<mailto:ccaulfie@xxxxxxxxxx>> wrote:

    Hi Jason,

    Thanks for testing that - and the extra info. I'll have another

    think then. If I can't come up with anything more we might go with

    your patch.

    Chrissie

    On 05/08/14 13:01, jason wrote:

        Hi Christine,

        I have tested your patch but it can not solve my problem. By adding

        printf, I found that whenever during retransmition occured in my

        test

        case or not, the retrans_message_queue is always empty. It seems

        that

        the retrans_message_queue is for recovery state used only?

        On Aug 5, 2014 3:50 PM, "Christine Caulfield"

        <ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>

        <mailto:ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>>> wrote:

             On 01/08/14 10:50, Christine Caulfield wrote:

                 On 01/08/14 10:42, Jan Friesse wrote:

                     Jason,

                         Hi All,

                         I have encountered a problem that when there is

        no other

                         activty on

                         ring but

                         only retransmition, and token is in hold mode, the

                         retransmition will

                         become

                         slow. More over, if the retransmition is always

        fail but

                         token

                     Yes

                         rotation works well,

                         then it takes quite a lone

        time(fail_to_recv_const *

                         token_hold = 2500

                         * 180ms = 450sec) for the retransmiting node to

        meet the

                         "FAILED TO

                         RECEIVE" condition to

                         re-construct a new ring. This can be reporduced

        by the

                         following steps:

                               1) Create a two-node cluster in udpu

        transport mode.

                               2) Wait until there is no other activty

        on ring.

                               3) One, or both nodes delete each other

        in nodelist in

                         corosync.conf

                               4) corosync-cfgtool -R, this can cause a

        message

                         retransmition,

                         but I am

                               not sure why.

                               5) Since tokenrotation still works well,

        but the

                         retransmition

                         can not be

                               satisfied due to node deletion, so, only

        "FAILED

                         TO RECEIVE"

                         condition can form new

                               ring. But we need to wait 450 seconds for

        it to

                         happen. During

                         this wait,

                               we saw the following logs:

                     This is really weird case.

                               Jul 30 11:21:06 notice  [TOTEM ]

        Retransmit List: e

                               Jul 30 11:21:06 notice  [TOTEM ]

        Retransmit List: e

                               Jul 30 11:21:06 notice  [TOTEM ]

        Retransmit List: e

                               Jul 30 11:21:06 notice  [TOTEM ]

        Retransmit List: e

                               Jul 30 11:21:06 notice  [TOTEM ]

        Retransmit List: e

                               ...

                         This problem can be solved by adding

                         token_hold_cancel_send() in both

                         retransmition request and response conditions in

                         orf_token_rtr() to

                         speed up

                         retransmition. I created a patch below, any

        comments?

                     Ok. Patch looks fine, but during review I had other

        idea.

                     What about

                     prohibit starting of hold mode where there are

        messages to

                     retransmit?

                     Such solution may be cleaner, isn't it?

                     Anyway. This is change in very critical part of the

        code, so

                     Chrissie,

                     can you please take a look to patch and express

        your opinion?

                 I've been looking it over yesterday. It's a problem I have

                 definitely

                 seen myself on some VM systems so it's certainly not an

        isolated

                 case. I

                 think Honza is right that there might be a better way

        of fixing

                 it so

                 I'll have a look.

                 Chrissie

             Annoyingly my common reproducer seems not to be working and

        I can't

             get yours to make it happen either. If you can still

        reproduce it

             could you try this patch for me please?

             Chrissie

             _________________________________________________

             discuss mailing list

        discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>

        <mailto:discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>>

        http://lists.corosync.org/__mailman/listinfo/discuss

        <http://lists.corosync.org/mailman/listinfo/discuss>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss