Re: CEPH - messenger rebind race

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 8 May 2014 07:16:30 -0700

On Thu, May 8, 2014 at 12:12 AM, Guang <yguang11@xxxxxxxxx> wrote:
> Thanks Sage and Greg!
>
> I just capture a crash from which I get the logs from both side (sadly we
> still use debug_ms=0 as increasing the debug_ms level dramatically consume
> more disk spaces), from the log, I think I get some more clues how this
> failure happened.
>
> Let us say we have two messengers C and S, at C side, it did a rebind and
> cleared everything.
> The logs at C side are:
> ————
>  -6> 2014-05-07 12:29:35.362448 7fcc26e8c700  2 --
> 10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
> sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
>  60). got newly_acked_seq 59 vs out_seq 0
>      -5> 2014-05-07 12:29:35.362487 7fcc26e8c700  2 --
> 10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
> sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
>  60). discarding previously sent 0 osd_map(36192..36196 src has
> 35392..36196) v3
>      -4> 2014-05-07 12:29:35.362504 7fcc26e8c700  2 --
> 10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
> sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
>  60). discarding previously sent 0 pg_notify(4.4d21(21) epoch 36196) v4
>      -3> 2014-05-07 12:29:35.362530 7fcc26e8c700  2 --
> 10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
> sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
>  60). discarding previously sent 0 pg_notify(3.4d22(21) epoch 36196) v4
>      -2> 2014-05-07 12:29:35.362561 7fcc26e8c700  2 --
> 10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
> sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
>  60). discarding previously sent 0 pg_query(4.3279 epoch 36196) v2
>      -1> 2014-05-07 12:29:35.362571 7fcc26e8c700  2 --
> 10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
> sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
>  60). discarding previously sent 0 pg_query(3.327a epoch 36196) v2
>       0> 2014-05-07 12:29:35.363457 7fcc26e8c700 -1 msg/Pipe.cc: In function
> 'int Pipe::connect()' thread 7fcc26e8c700 time 2014-05-07 12:29:35.362587
>  msg/Pipe.cc: 1045: FAILED assert(m)
> ————
> The logs at S side are:
> ———
> 2014-05-07 12:29:09.820735 7eff50794700  0 -- 10.193.207.182:6982/57027253
>>> 10.193.207.191:6918/24089488 pipe(0x170bf680 sd=63 :57524 s=2 pgs=24805
> cs=1 l=0 c=0x1691b340).fault with nothing to send, going to standby
> 2014-05-07 12:29:35.361030 7eff40042700 10 osd.363 36196  new session
> 0x17d33500 con=0x17fdf080 addr=10.193.207.191:6918/24089488
> 2014-05-07 12:29:35.361068 7eff40042700  0 -- 10.193.207.182:6982/57027253
>>> 10.193.207.191:6918/24089488 pipe(0xfe07400 sd=343 :6982 s=0 pgs=0 cs=0
> l=0 c=0x17fdf080).accept
>  connect_seq 0 vs existing 1 state standby
> 2014-05-07 12:29:35.361078 7eff40042700  0 -- 10.193.207.182:6982/57027253
>>> 10.193.207.191:6918/24089488 pipe(0xfe07400 sd=343 :6982 s=0 pgs=0 cs=0
> l=0 c=0x17fdf080).accept
>  peer reset, then tried to connect to us, replacing
> ———
> And it looks like the below round trips happened between the two sides.
>  1. C tried to establish a new connection (when it gets some message to
> send) with S with cs=0 (and a new larger pgs)
>  2. At S side, there was an existing connection with C (cs=1, pgs=24805,
> in_seq>59).
>  3. This is what happened at S side:
>       3.1 lookup pipe and find an existing one -
> https://github.com/ceph/ceph/blob/master/src/msg/Pipe.cc#L445
>       3.2 the execution flow comes to replacing the existing pipe -
> https://github.com/ceph/ceph/blob/master/src/msg/Pipe.cc#L473
>       3.3 as this is a lossy connection, the original pipe would be cleared.
>       3.4 then it comes to the open section
> (https://github.com/ceph/ceph/blob/master/src/msg/Pipe.cc#L649) and echo
> back a reply with tag CEPH_MSGR_TAG_SEQ along with the seq from original
> pipe
> 4. At C side
> 4.1 as this is a brand new pipe so that out_seq is zero, however, the above
> message comes with along a original newly_ack_seq, this bring the assertion
> failure we observed.
>
> If this makes sense, at step 3.4, we properly should not echo original seq
> if this is a branch new connection attempt (cs = 0).
>
> Does this look like reasonable?

Ah! Yes, you're exactly right. This error makes sense; too — the
CEPH_MSGR_TAG_SEQ is much newer than the rest of the reconnect logic.

On Thu, May 8, 2014 at 4:58 AM, Guang <yguang11@xxxxxxxxx> wrote:
> I think the next question is, why C failed to find a PIPE (locally) for S
> but there is actually one PIPE existing at S side (with C), is it possible
> that C unregister the pipe but S has not been aware yet?

Exactly. In the other logs I've seen, C has done a mark_down_all()
(which kills every single pipe) while doing a rebind(); generally this
is as a consequence of getting an OSDMap which says it's down. If S
has not yet seen this map, it won't have nuked the Connection on its
side.

Would you like to prepare a patch, or should I? :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html