Re: totemsrp: Cancel token holding while in retransmition

jason <huzhijiang@xxxxxxxxx> · Wed, 6 Aug 2014 00:53:56 +0800

Hi  Chrissie,

By studying your patch, I create a new patch which can solve my
problem and a bit more matching your point. Please review it. Thanks!

On Tue, Aug 5, 2014 at 10:18 PM, Christine Caulfield
<ccaulfie@xxxxxxxxxx> wrote:
> Hi Jason,
>
> Thanks for testing that - and the extra info. I'll have another think then.
> If I can't come up with anything more we might go with your patch.
>
> Chrissie
>
>
> On 05/08/14 13:01, jason wrote:
>>
>> Hi Christine,
>> I have tested your patch but it can not solve my problem. By adding
>> printf, I found that whenever during retransmition occured in my test
>> case or not, the retrans_message_queue is always empty. It seems that
>> the retrans_message_queue is for recovery state used only?
>>
>> On Aug 5, 2014 3:50 PM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx
>> <mailto:ccaulfie@xxxxxxxxxx>> wrote:
>>
>>     On 01/08/14 10:50, Christine Caulfield wrote:
>>
>>         On 01/08/14 10:42, Jan Friesse wrote:
>>
>>             Jason,
>>
>>
>>                 Hi All,
>>
>>                 I have encountered a problem that when there is no other
>>                 activty on
>>                 ring but
>>                 only retransmition, and token is in hold mode, the
>>                 retransmition will
>>                 become
>>                 slow. More over, if the retransmition is always fail but
>>                 token
>>
>>
>>             Yes
>>
>>                 rotation works well,
>>                 then it takes quite a lone time(fail_to_recv_const *
>>                 token_hold = 2500
>>                 * 180ms = 450sec) for the retransmiting node to meet the
>>                 "FAILED TO
>>                 RECEIVE" condition to
>>                 re-construct a new ring. This can be reporduced by the
>>                 following steps:
>>
>>                       1) Create a two-node cluster in udpu transport mode.
>>                       2) Wait until there is no other activty on ring.
>>                       3) One, or both nodes delete each other in nodelist
>> in
>>                 corosync.conf
>>                       4) corosync-cfgtool -R, this can cause a message
>>                 retransmition,
>>                 but I am
>>                       not sure why.
>>                       5) Since tokenrotation still works well, but the
>>                 retransmition
>>                 can not be
>>                       satisfied due to node deletion, so, only "FAILED
>>                 TO RECEIVE"
>>                 condition can form new
>>                       ring. But we need to wait 450 seconds for it to
>>                 happen. During
>>                 this wait,
>>                       we saw the following logs:
>>
>>
>>             This is really weird case.
>>
>>                       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>                       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>                       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>                       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>                       Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
>>                       ...
>>
>>
>>                 This problem can be solved by adding
>>                 token_hold_cancel_send() in both
>>                 retransmition request and response conditions in
>>                 orf_token_rtr() to
>>                 speed up
>>                 retransmition. I created a patch below, any comments?
>>
>>
>>             Ok. Patch looks fine, but during review I had other idea.
>>             What about
>>             prohibit starting of hold mode where there are messages to
>>             retransmit?
>>             Such solution may be cleaner, isn't it?
>>
>>             Anyway. This is change in very critical part of the code, so
>>             Chrissie,
>>             can you please take a look to patch and express your opinion?
>>
>>
>>
>>         I've been looking it over yesterday. It's a problem I have
>>         definitely
>>         seen myself on some VM systems so it's certainly not an isolated
>>         case. I
>>         think Honza is right that there might be a better way of fixing
>>         it so
>>         I'll have a look.
>>
>>         Chrissie
>>
>>
>>
>>     Annoyingly my common reproducer seems not to be working and I can't
>>     get yours to make it happen either. If you can still reproduce it
>>     could you try this patch for me please?
>>
>>     Chrissie
>>
>>
>>     _______________________________________________
>>     discuss mailing list
>>     discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>>     http://lists.corosync.org/mailman/listinfo/discuss
>>
>

-- 
Yours,
Jason
Attachment:
0001-totemsrp-Cancel-token-holding-while-in-retransmition.patch

Description: Binary data
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss