Re: About in_seq, out_seq in Messenger

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 6, 2015 at 10:47 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> ----- Original Message -----
>> From: "Haomai Wang" <haomaiwang@xxxxxxxxx>
>> To: "Sage Weil" <sweil@xxxxxxxxxx>, "Gregory Farnum" <greg@xxxxxxxxxxx>
>> Cc: ceph-devel@xxxxxxxxxxxxxxx
>> Sent: Friday, February 6, 2015 12:26:18 AM
>> Subject: About in_seq, out_seq in Messenger
>>
>> Hi all,
>>
>> Recently we enable a async messenger test job in test
>> lab(http://pulpito.ceph.com/sage-2015-02-03_01:15:10-rados-master-distro-basic-multi/#).
>> We hit many failed assert mostly are:
>>               assert(0 == "old msgs despite reconnect_seq feature");
>>
>> And assert connection all are cluster messenger which mean it's OSD
>> internal connection. The policy associated this connection is
>> Messenger::Policy::lossless_peer.
>>
>> So when I dive into this problem, I find something confusing about
>> this. Suppose these steps:
>> 1. "lossless_peer" policy is used by both two side connections.
>> 2. markdown one side(anyway), peer connection will try to reconnect
>> 3. then we restart failed side, a new connection is built but
>> initiator will think it's a old connection so sending in_seq(10)
>> 4. new started connection has no message in queue and it will receive
>> peer connection's in_seq(10) and call discard_requeued_up_to(10). But
>> because no message in queue, it won't modify anything
>> 5. now any side issue a message, it will trigger "assert(0 == "old
>> msgs despite reconnect_seq feature");"
>>
>> I can replay these steps in unittest and actually it's hit in test lab
>> for async messenger which follows simple messenger's design.
>>
>> Besides, if we enable reset_check here, "was_session_reset" will be
>> called and it will random out_seq, so it will certainly hit "assert(0
>> == "skipped incoming seq")".
>>
>> Anything wrong above?
>
> Sage covered most of this. I'll just add that the last time I checked it, I came to the conclusion that the code to use a random out_seq on initial connect was non-functional. So there definitely may be issues there.
>
> In fact, we've fixed a couple (several?) bugs in this area since Firefly was initially released, so if you go over the point release SimpleMessenger patches you might gain some insight. :)
> -Greg

If we want to make random out_seq functional, I think we need to
exchange "out_seq" when handshaking too. Otherwise, we need to give it
up.

Another question, do you think "reset_check=true" is always good for
osd internal connection?

Let Messenger rely on upper layer may not a good idea, so maybe we can
enhance "in_seq" exchange process(ensure each side
in_seq+sent.size()==out_seq). From the current handshake impl, it's
not easy to insert more action to "in_seq" exchange process, because
this session has been built regardless of the result of "in_seq"
process.

If enable "reset_check=true", it looks we can solve most of incorrect
seq out-of-sync problem?

-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux