On 06/08/14 02:09, jason wrote:
Sorry, the previous patch is wrong. Here is the correction.
That looks good to me and, I think, the best solution. It seems to be decidedly non-trivial to determine if retransmits are present when going into hold.
Thanks! Chrissie
On Aug 5, 2014 10:18 PM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>> wrote: Hi Jason, Thanks for testing that - and the extra info. I'll have another think then. If I can't come up with anything more we might go with your patch. Chrissie On 05/08/14 13:01, jason wrote: Hi Christine, I have tested your patch but it can not solve my problem. By adding printf, I found that whenever during retransmition occured in my test case or not, the retrans_message_queue is always empty. It seems that the retrans_message_queue is for recovery state used only? On Aug 5, 2014 3:50 PM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx> <mailto:ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>>> wrote: On 01/08/14 10:50, Christine Caulfield wrote: On 01/08/14 10:42, Jan Friesse wrote: Jason, Hi All, I have encountered a problem that when there is no other activty on ring but only retransmition, and token is in hold mode, the retransmition will become slow. More over, if the retransmition is always fail but token Yes rotation works well, then it takes quite a lone time(fail_to_recv_const * token_hold = 2500 * 180ms = 450sec) for the retransmiting node to meet the "FAILED TO RECEIVE" condition to re-construct a new ring. This can be reporduced by the following steps: 1) Create a two-node cluster in udpu transport mode. 2) Wait until there is no other activty on ring. 3) One, or both nodes delete each other in nodelist in corosync.conf 4) corosync-cfgtool -R, this can cause a message retransmition, but I am not sure why. 5) Since tokenrotation still works well, but the retransmition can not be satisfied due to node deletion, so, only "FAILED TO RECEIVE" condition can form new ring. But we need to wait 450 seconds for it to happen. During this wait, we saw the following logs: This is really weird case. Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e Jul 30 11:21:06 notice [TOTEM ] Retransmit List: e ... This problem can be solved by adding token_hold_cancel_send() in both retransmition request and response conditions in orf_token_rtr() to speed up retransmition. I created a patch below, any comments? Ok. Patch looks fine, but during review I had other idea. What about prohibit starting of hold mode where there are messages to retransmit? Such solution may be cleaner, isn't it? Anyway. This is change in very critical part of the code, so Chrissie, can you please take a look to patch and express your opinion? I've been looking it over yesterday. It's a problem I have definitely seen myself on some VM systems so it's certainly not an isolated case. I think Honza is right that there might be a better way of fixing it so I'll have a look. Chrissie Annoyingly my common reproducer seems not to be working and I can't get yours to make it happen either. If you can still reproduce it could you try this patch for me please? Chrissie _________________________________________________ discuss mailing list discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> <mailto:discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>> http://lists.corosync.org/__mailman/listinfo/discuss <http://lists.corosync.org/mailman/listinfo/discuss>
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss