Re: An issue about retransmit growing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 19, 2013 at 07:44:21AM -0700, Steven Dake wrote:
> On 03/19/2013 03:18 AM, Guangliang Zhao wrote:
> >Hi list,

Hi Steven,

Thanks for your reply.

> >
> >There is a issue when I tested corosync(v1.4.5) with 11 nodes. I am not very familiar with the corosync, so please correct me if I am wrong. The steps are following:
> >
> >1.Make sure the corosync debug is off
> >2.Start openais on every node, and all of them are ok.
> >3.Stop openais on 5 nodes, it takes so longe time, and the retransmit list started growing.
> >
> >I got a piece of log from one node via corosync-blackbox:
> >
> >rec=[79224] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 1fd
> >rec=[79225] Tracing(1) Messsage=Delivering 1fc to 1fd
> >rec=[79226] Tracing(1) Messsage=Delivering MCAST message with seq 1fd to pending delivery queue
> >rec=[79227] Tracing(1) Messsage=releasing messages up to and including 1fb
> >rec=[79228] Tracing(1) Messsage=releasing messages up to and including 1fd
> >rec=[79229] Log Message=got quorate request on 0x6d0980
> >rec=[79230] Log Message=got quorate request on 0x6d0980
> >rec=[79231] Log Message=Retransmit List 1
> >rec=[79232] Log Message=Retransmit List: 201
> >rec=[79233] Tracing(1) Messsage=mcasted message added to pending queue
> >rec=[79234] Log Message=Retransmit List 1
> >rec=[79235] Log Message=Retransmit List: 201
> >rec=[79236] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79237] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 205
> >rec=[79238] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79239] Log Message=Retransmit List 1
> >rec=[79240] Log Message=Retransmit List: 201
> >rec=[79241] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79242] Log Message=Retransmit List 2
> >rec=[79243] Log Message=Retransmit List: 201 202
> >rec=[79244] Tracing(1) Messsage=Delivering 1fd to 205
> >rec=[79245] Log Message=Retransmit List 2
> >rec=[79246] Log Message=Retransmit List: 201 202
> >
> >There is a piece of code in exec/totemsrp.c:
> >
> >3775         if (range) {
> >3776                 TRACE1 ("Delivering %x to %x\n", instance->my_high_delivered,
> >3777                         end_point);
> >3778         }
> >
> >...
> >
> >3785         for (i = 1; i <= range; i++) {
> >3786
> >3787                 void *ptr = 0;
> >3788
> >3789                 /*
> >3790                  * If out of range of sort queue, stop assembly
> >3791                  */
> >3792                 res = sq_in_range (&instance->regular_sort_queue,
> >3793                         my_high_delivered_stored + i);
> >3794                 if (res == 0) {
> >3795                         break;
> >3796                 }
> >3797
> >3798                 res = sq_item_get (&instance->regular_sort_queue,
> >3799                         my_high_delivered_stored + i, &ptr);
> >3800                 /*
> >3801                  * If hole, stop assembly
> >3802                  */
> >3803                 if (res != 0 && skip == 0) {
> >3804                         break;
> >3805                 }
> >3806
> >3807                 instance->my_high_delivered = my_high_delivered_stored + i;
> >
> >...
> >
> >3841                 /*
> >3842                  * Message found
> >3843                  */
> >3844                 TRACE1 ("Delivering MCAST message with seq %x to pending delivery queue\n",
> >3845                         mcast_header.seq);
> >
> > From these log and code, We could know that the message 1fe 1ff 200 have not been delivered and it should jump out of the loop through the two break sentences.
> >
> >The first if only check the seq id range, and the second one should be the most suspect.
> >
> >include/corosync/sq.h:
> >
> >264 static inline unsigned int sq_item_get (
> >265         const struct sq *sq,
> >266         unsigned int seq_id,
> >267         void **sq_item_out)
> >
> >...
> >
> >286         if (sq->items_inuse[sq_position] == 0) {
> >287                 return (ENOENT);
> >288         }
> >I think the items_inuse array maybe cleared sometimes, and it return 0 when we access it. However, I couldn't study deep in more, so could anyone give me some hints?
> >
> 
> items_inuse[sq_position] should contain zero if there is no entry.
> If there is no entry, we want to stop processing in the above code
> because it is a hole in the messages.

If we want skip the hole in the messages, I think the my_high_delivered
or more parameters should be updated, but didn't, so it always try to
deliver the messages from my_high_delivered + 1, but couldn't success,
because the my_high_delivered + 1 message is a hole?   

I collected the result of corosync-blackbox from one of the nodes, but it is a
pretty big log, I would add it as an attachment next mail if you need.

> 
> The sort queue is a circular array which is cleared as
> sq_item_release is called.  This should only occur after the message
> has been delivered to all nodes on the ring in
> totemsrp.c:messages_free.
> 
> Regards
> -steve
> 

-- 
Best regards,
Guangliang
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux