Hi list, There is a issue when I tested corosync(v1.4.5) with 11 nodes. I am not very familiar with the corosync, so please correct me if I am wrong. The steps are following: 1.Make sure the corosync debug is off 2.Start openais on every node, and all of them are ok. 3.Stop openais on 5 nodes, it takes so longe time, and the retransmit list started growing. I got a piece of log from one node via corosync-blackbox: rec=[79224] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 1fd rec=[79225] Tracing(1) Messsage=Delivering 1fc to 1fd rec=[79226] Tracing(1) Messsage=Delivering MCAST message with seq 1fd to pending delivery queue rec=[79227] Tracing(1) Messsage=releasing messages up to and including 1fb rec=[79228] Tracing(1) Messsage=releasing messages up to and including 1fd rec=[79229] Log Message=got quorate request on 0x6d0980 rec=[79230] Log Message=got quorate request on 0x6d0980 rec=[79231] Log Message=Retransmit List 1 rec=[79232] Log Message=Retransmit List: 201 rec=[79233] Tracing(1) Messsage=mcasted message added to pending queue rec=[79234] Log Message=Retransmit List 1 rec=[79235] Log Message=Retransmit List: 201 rec=[79236] Tracing(1) Messsage=Delivering 1fd to 205 rec=[79237] Tracing(1) Messsage=Received ringid(192.168.100.1:6564) seq 205 rec=[79238] Tracing(1) Messsage=Delivering 1fd to 205 rec=[79239] Log Message=Retransmit List 1 rec=[79240] Log Message=Retransmit List: 201 rec=[79241] Tracing(1) Messsage=Delivering 1fd to 205 rec=[79242] Log Message=Retransmit List 2 rec=[79243] Log Message=Retransmit List: 201 202 rec=[79244] Tracing(1) Messsage=Delivering 1fd to 205 rec=[79245] Log Message=Retransmit List 2 rec=[79246] Log Message=Retransmit List: 201 202 There is a piece of code in exec/totemsrp.c: 3775 if (range) { 3776 TRACE1 ("Delivering %x to %x\n", instance->my_high_delivered, 3777 end_point); 3778 } ... 3785 for (i = 1; i <= range; i++) { 3786 3787 void *ptr = 0; 3788 3789 /* 3790 * If out of range of sort queue, stop assembly 3791 */ 3792 res = sq_in_range (&instance->regular_sort_queue, 3793 my_high_delivered_stored + i); 3794 if (res == 0) { 3795 break; 3796 } 3797 3798 res = sq_item_get (&instance->regular_sort_queue, 3799 my_high_delivered_stored + i, &ptr); 3800 /* 3801 * If hole, stop assembly 3802 */ 3803 if (res != 0 && skip == 0) { 3804 break; 3805 } 3806 3807 instance->my_high_delivered = my_high_delivered_stored + i; ... 3841 /* 3842 * Message found 3843 */ 3844 TRACE1 ("Delivering MCAST message with seq %x to pending delivery queue\n", 3845 mcast_header.seq); >From these log and code, We could know that the message 1fe 1ff 200 have not been delivered and it should jump out of the loop through the two break sentences. The first if only check the seq id range, and the second one should be the most suspect. include/corosync/sq.h: 264 static inline unsigned int sq_item_get ( 265 const struct sq *sq, 266 unsigned int seq_id, 267 void **sq_item_out) ... 286 if (sq->items_inuse[sq_position] == 0) { 287 return (ENOENT); 288 } I think the items_inuse array maybe cleared sometimes, and it return 0 when we access it. However, I couldn't study deep in more, so could anyone give me some hints? -- Best regards, Guangliang _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss