Re: totemsrp: Cancel token holding while in retransmition

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01/08/14 10:42, Jan Friesse wrote:
Jason,


Hi All,

I have encountered a problem that when there is no other activty on
ring but
only retransmition, and token is in hold mode, the retransmition will
become
slow. More over, if the retransmition is always fail but token

Yes

rotation works well,
then it takes quite a lone time(fail_to_recv_const * token_hold = 2500
* 180ms = 450sec) for the retransmiting node to meet the "FAILED TO
RECEIVE" condition to
re-construct a new ring. This can be reporduced by the following steps:

     1) Create a two-node cluster in udpu transport mode.
     2) Wait until there is no other activty on ring.
     3) One, or both nodes delete each other in nodelist in corosync.conf
     4) corosync-cfgtool -R, this can cause a message retransmition,
but I am
     not sure why.
     5) Since tokenrotation still works well, but the retransmition
can not be
     satisfied due to node deletion, so, only "FAILED TO RECEIVE"
condition can form new
     ring. But we need to wait 450 seconds for it to happen. During
this wait,
     we saw the following logs:


This is really weird case.

     Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
     Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
     Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
     Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
     Jul 30 11:21:06 notice  [TOTEM ] Retransmit List: e
     ...


This problem can be solved by adding token_hold_cancel_send() in both
retransmition request and response conditions in orf_token_rtr() to
speed up
retransmition. I created a patch below, any comments?


Ok. Patch looks fine, but during review I had other idea. What about
prohibit starting of hold mode where there are messages to retransmit?
Such solution may be cleaner, isn't it?

Anyway. This is change in very critical part of the code, so Chrissie,
can you please take a look to patch and express your opinion?


I've been looking it over yesterday. It's a problem I have definitely seen myself on some VM systems so it's certainly not an isolated case. I think Honza is right that there might be a better way of fixing it so I'll have a look.

Chrissie

Regards,
   Honza


     Signed-off-by: Jason HU <huzhijiang@xxxxxxxxx>

------------------------------- exec/totemsrp.c
-------------------------------
index dcda8d1..c227c44 100644
@@ -2672,6 +2672,7 @@ static int orf_token_rtr (

      strcpy (retransmit_msg, "Retransmit List: ");
      if (orf_token->rtr_list_entries) {
+        token_hold_cancel_send(instance);
          log_printf (instance->totemsrp_log_level_debug,
              "Retransmit List %d", orf_token->rtr_list_entries);
          for (i = 0; i < orf_token->rtr_list_entries; i++) {
@@ -2726,6 +2727,10 @@ static int orf_token_rtr (
      range = orf_token->seq - instance->my_aru;
      assert (range < QUEUE_RTR_ITEMS_SIZE_MAX);

+    if (range >= 1) {
+        token_hold_cancel_send(instance);
+    }
+
      for (i = 1; (orf_token->rtr_list_entries <
RETRANSMIT_ENTRIES_MAX) &&
          (i <= range); i++) {






_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss




[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux