----- Original Message ----- > From: "Haomai Wang" <haomaiwang@xxxxxxxxx> > To: "Sage Weil" <sweil@xxxxxxxxxx>, "Gregory Farnum" <greg@xxxxxxxxxxx> > Cc: ceph-devel@xxxxxxxxxxxxxxx > Sent: Friday, February 6, 2015 12:26:18 AM > Subject: About in_seq, out_seq in Messenger > > Hi all, > > Recently we enable a async messenger test job in test > lab(http://pulpito.ceph.com/sage-2015-02-03_01:15:10-rados-master-distro-basic-multi/#). > We hit many failed assert mostly are: > assert(0 == "old msgs despite reconnect_seq feature"); > > And assert connection all are cluster messenger which mean it's OSD > internal connection. The policy associated this connection is > Messenger::Policy::lossless_peer. > > So when I dive into this problem, I find something confusing about > this. Suppose these steps: > 1. "lossless_peer" policy is used by both two side connections. > 2. markdown one side(anyway), peer connection will try to reconnect > 3. then we restart failed side, a new connection is built but > initiator will think it's a old connection so sending in_seq(10) > 4. new started connection has no message in queue and it will receive > peer connection's in_seq(10) and call discard_requeued_up_to(10). But > because no message in queue, it won't modify anything > 5. now any side issue a message, it will trigger "assert(0 == "old > msgs despite reconnect_seq feature");" > > I can replay these steps in unittest and actually it's hit in test lab > for async messenger which follows simple messenger's design. > > Besides, if we enable reset_check here, "was_session_reset" will be > called and it will random out_seq, so it will certainly hit "assert(0 > == "skipped incoming seq")". > > Anything wrong above? Sage covered most of this. I'll just add that the last time I checked it, I came to the conclusion that the code to use a random out_seq on initial connect was non-functional. So there definitely may be issues there. In fact, we've fixed a couple (several?) bugs in this area since Firefly was initially released, so if you go over the point release SimpleMessenger patches you might gain some insight. :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html