Re: About in_seq, out_seq in Messenger

Haomai Wang <haomaiwang@xxxxxxxxx> · Sat, 7 Feb 2015 00:00:43 +0800

On Fri, Feb 6, 2015 at 5:40 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Fri, 6 Feb 2015, Haomai Wang wrote:
>> Hi all,
>>
>> Recently we enable a async messenger test job in test
>> lab(http://pulpito.ceph.com/sage-2015-02-03_01:15:10-rados-master-distro-basic-multi/#).
>> We hit many failed assert mostly are:
>>               assert(0 == "old msgs despite reconnect_seq feature");
>>
>> And assert connection all are cluster messenger which mean it's OSD
>> internal connection. The policy associated this connection is
>> Messenger::Policy::lossless_peer.
>>
>> So when I dive into this problem, I find something confusing about
>> this. Suppose these steps:
>> 1. "lossless_peer" policy is used by both two side connections.
>> 2. markdown one side(anyway), peer connection will try to reconnect
>> 3. then we restart failed side, a new connection is built but
>> initiator will think it's a old connection so sending in_seq(10)
>> 4. new started connection has no message in queue and it will receive
>> peer connection's in_seq(10) and call discard_requeued_up_to(10). But
>> because no message in queue, it won't modify anything
>
> The way this case is normally handled is one layer up.  The messenger
> doesn't know at that level whether it is seeing a bug or whether someone
> it marked down and forgot about doesn't know it is dead and is trying to
> reconnect.
>
> In OSD.cc, we have a check in require_same_or_newer_map that will defer
> the Message if the other end is newer than us.  If we are newer (as would
> be the case if we marked down the peer and they don't realize they are
> dead) then require_same_peer_instance() will catch it and mark_down(),
> again throwing out all of our state about the old osd session.
>
>> 5. now any side issue a message, it will trigger "assert(0 == "old
>> msgs despite reconnect_seq feature");"
>
> IIRC the above checks ensure that we mark_down the connection before we
> send a message on the old session.  (Or, SimpleMessenger doesn't assert; I
> forget which, sorry.)
>
>> I can replay these steps in unittest and actually it's hit in test lab
>> for async messenger which follows simple messenger's design.
>>
>> Besides, if we enable reset_check here, "was_session_reset" will be
>> called and it will random out_seq, so it will certainly hit "assert(0
>> == "skipped incoming seq")".
>>
>> Anything wrong above?
>
> What happens when you run the unit test against SimpleMessenger?

You can just run "./ceph_test_msgr --gtest_filter="*Inject*/1"
--ms_die_on_skipped_message=true" got assert failure.

It's just because of random seq.

>
> sage
>
>
>>
>> --
>> Best Regards,
>>
>> Wheat
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html