Re: About in_seq, out_seq in Messenger

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 6 Feb 2015, Haomai Wang wrote:
> Hi all,
> 
> Recently we enable a async messenger test job in test
> lab(http://pulpito.ceph.com/sage-2015-02-03_01:15:10-rados-master-distro-basic-multi/#).
> We hit many failed assert mostly are:
>               assert(0 == "old msgs despite reconnect_seq feature");
> 
> And assert connection all are cluster messenger which mean it's OSD
> internal connection. The policy associated this connection is
> Messenger::Policy::lossless_peer.
> 
> So when I dive into this problem, I find something confusing about
> this. Suppose these steps:
> 1. "lossless_peer" policy is used by both two side connections.
> 2. markdown one side(anyway), peer connection will try to reconnect
> 3. then we restart failed side, a new connection is built but
> initiator will think it's a old connection so sending in_seq(10)
> 4. new started connection has no message in queue and it will receive
> peer connection's in_seq(10) and call discard_requeued_up_to(10). But
> because no message in queue, it won't modify anything

The way this case is normally handled is one layer up.  The messenger 
doesn't know at that level whether it is seeing a bug or whether someone 
it marked down and forgot about doesn't know it is dead and is trying to 
reconnect.

In OSD.cc, we have a check in require_same_or_newer_map that will defer 
the Message if the other end is newer than us.  If we are newer (as would 
be the case if we marked down the peer and they don't realize they are 
dead) then require_same_peer_instance() will catch it and mark_down(), 
again throwing out all of our state about the old osd session.

> 5. now any side issue a message, it will trigger "assert(0 == "old
> msgs despite reconnect_seq feature");"

IIRC the above checks ensure that we mark_down the connection before we 
send a message on the old session.  (Or, SimpleMessenger doesn't assert; I 
forget which, sorry.)

> I can replay these steps in unittest and actually it's hit in test lab
> for async messenger which follows simple messenger's design.
> 
> Besides, if we enable reset_check here, "was_session_reset" will be
> called and it will random out_seq, so it will certainly hit "assert(0
> == "skipped incoming seq")".
> 
> Anything wrong above?

What happens when you run the unit test against SimpleMessenger?

sage


> 
> -- 
> Best Regards,
> 
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux