On Fri, 6 Feb 2015, Haomai Wang wrote: > Hi all, > > Recently we enable a async messenger test job in test > lab(http://pulpito.ceph.com/sage-2015-02-03_01:15:10-rados-master-distro-basic-multi/#). > We hit many failed assert mostly are: > assert(0 == "old msgs despite reconnect_seq feature"); > > And assert connection all are cluster messenger which mean it's OSD > internal connection. The policy associated this connection is > Messenger::Policy::lossless_peer. > > So when I dive into this problem, I find something confusing about > this. Suppose these steps: > 1. "lossless_peer" policy is used by both two side connections. > 2. markdown one side(anyway), peer connection will try to reconnect > 3. then we restart failed side, a new connection is built but > initiator will think it's a old connection so sending in_seq(10) > 4. new started connection has no message in queue and it will receive > peer connection's in_seq(10) and call discard_requeued_up_to(10). But > because no message in queue, it won't modify anything The way this case is normally handled is one layer up. The messenger doesn't know at that level whether it is seeing a bug or whether someone it marked down and forgot about doesn't know it is dead and is trying to reconnect. In OSD.cc, we have a check in require_same_or_newer_map that will defer the Message if the other end is newer than us. If we are newer (as would be the case if we marked down the peer and they don't realize they are dead) then require_same_peer_instance() will catch it and mark_down(), again throwing out all of our state about the old osd session. > 5. now any side issue a message, it will trigger "assert(0 == "old > msgs despite reconnect_seq feature");" IIRC the above checks ensure that we mark_down the connection before we send a message on the old session. (Or, SimpleMessenger doesn't assert; I forget which, sorry.) > I can replay these steps in unittest and actually it's hit in test lab > for async messenger which follows simple messenger's design. > > Besides, if we enable reset_check here, "was_session_reset" will be > called and it will random out_seq, so it will certainly hit "assert(0 > == "skipped incoming seq")". > > Anything wrong above? What happens when you run the unit test against SimpleMessenger? sage > > -- > Best Regards, > > Wheat > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html