Hi Chrissie, By studying your patch, I create a new patch which can solve my problem and a bit more matching your point. Please review it. Thanks! On Tue, Aug 5, 2014 at 10:18 PM, Christine Caulfield <ccaulfie@xxxxxxxxxx> wrote: > Hi Jason, > > Thanks for testing that - and the extra info. I'll have another think then. > If I can't come up with anything more we might go with your patch. > > Chrissie > > > On 05/08/14 13:01, jason wrote: >> >> Hi Christine, >> I have tested your patch but it can not solve my problem. By adding >> printf, I found that whenever during retransmition occured in my test >> case or not, the retrans_message_queue is always empty. It seems that >> the retrans_message_queue is for recovery state used only? >> >> On Aug 5, 2014 3:50 PM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx >> <mailto:ccaulfie@xxxxxxxxxx>> wrote: >> >> On 01/08/14 10:50, Christine Caulfield wrote: >> >> On 01/08/14 10:42, Jan Friesse wrote: >> >> Jason, >> >> >> Hi All, >> >> I have encountered a problem that when there is no other >> activty on >> ring but >> only retransmition, and token is in hold mode, the >> retransmition will >> become >> slow. More over, if the retransmition is always fail but >> token >> >> >> Yes >> >> rotation works well, >> then it takes quite a lone time(fail_to_recv_const * >> token_hold = 2500 >> * 180ms = 450sec) for the retransmiting node to meet the >> "FAILED TO >> RECEIVE" condition to >> re-construct a new ring. This can be reporduced by the >> following steps: >> >> 1) Create a two-node cluster in udpu transport mode. >> 2) Wait until there is no other activty on ring. >> 3) One, or both nodes delete each other in nodelist >> in >> corosync.conf >> 4) corosync-cfgtool -R, this can cause a message >> retransmition, >> but I am >> not sure why. >> 5) Since tokenrotation still works well, but the >> retransmition >> can not be >> satisfied due to node deletion, so, only "FAILED >> TO RECEIVE" >> condition can form new >> ring. But we need to wait 450 seconds for it to >> happen. During >> this wait, >> we saw the following logs: >> >> >> This is really weird case. >> >> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >> Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e >> ... >> >> >> This problem can be solved by adding >> token_hold_cancel_send() in both >> retransmition request and response conditions in >> orf_token_rtr() to >> speed up >> retransmition. I created a patch below, any comments? >> >> >> Ok. Patch looks fine, but during review I had other idea. >> What about >> prohibit starting of hold mode where there are messages to >> retransmit? >> Such solution may be cleaner, isn't it? >> >> Anyway. This is change in very critical part of the code, so >> Chrissie, >> can you please take a look to patch and express your opinion? >> >> >> >> I've been looking it over yesterday. It's a problem I have >> definitely >> seen myself on some VM systems so it's certainly not an isolated >> case. I >> think Honza is right that there might be a better way of fixing >> it so >> I'll have a look. >> >> Chrissie >> >> >> >> Annoyingly my common reproducer seems not to be working and I can't >> get yours to make it happen either. If you can still reproduce it >> could you try this patch for me please? >> >> Chrissie >> >> >> _______________________________________________ >> discuss mailing list >> discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> >> http://lists.corosync.org/mailman/listinfo/discuss >> > -- Yours, Jason
Attachment:
0001-totemsrp-Cancel-token-holding-while-in-retransmition.patch
Description: Binary data
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss