Re: msgr2 protocol

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 13 Sep 2016 10:48:54 -0400

On Tue, 2016-09-13 at 13:31 +0000, Sage Weil wrote:
> On Tue, 13 Sep 2016, Jeff Layton wrote:
> > On Sun, 2016-09-11 at 17:05 +0000, Sage Weil wrote:
> > > On Sat, 10 Sep 2016, Haomai Wang wrote:
> > > > About thing is v1/v2 compatible. I rethink the details:
> > > > 
> > > > 0. we need to define the new banner which must longer than before("ceph v027")
> > > > 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
> > > > 2. both in simle/async codes, server side must issue banner firstly
> > > > 3. if server side supports v2 and client only supports v1, client will
> > > > receive 9 bytes and do memcmp, then reject this connection via closing
> > > > socket. So server side could retry the older version
> > > > 4. if server side only supports v1 and client supports v2, client
> > > > according banner to reply corresponding banner
> > > > 
> > > > This tricky design is based on the implementation fact "accept side
> > > > issue the banner firstly" and "new banner is longer than old banner",
> > > > and this way doesn't need to involve other dependences like mon port
> > > > changes.
> > > > 
> > > > Does this way has problem?
> > > 
> > > I was thinking we avoid this problem and any hacky initial handshakes by 
> > > speaking v2 on the new port and v1 on the old port.  Then the monmap has 
> > > an entity_addrvec_t with both a v1 and v2 address (encoding with just the 
> > > v1 address for old clients). Same for the OSDs.
> > > 
> > > The v1 handshake just isn't extensible (how do you tell a v2 client 
> > > connecting that you speak both v1 and v2?).
> > > 
> > 
> > Depending on port assignments for the protocol is pretty icky though.
> > There may be valid reasons to use different ports in some environments
> > and then that heuristic goes right out the window.
> > 
> > One thing that is really strange about both the old and new protocols
> > is that they have the client and server sending the initial exchange
> > concurrently, or have the server send it first.  While it may speed up
> > the initial negotiation slightly, it makes it really hard to handle
> > fallback to earlier protocol versions (as Haomai pointed out), as the
> > client is responsible for handing reconnects.
> > 
> > Consider the case where we have a client that supports only v1 but a
> > server that supports v1 and v2. Client connects and then server sends a
> > v2 message. Client doesn't understand it and closes the connection and
> > reconnects, only to end up in the same situation on the second attempt.
> > 
> > There's no way for the server to preserve the state from the initial
> > connection attempt and handle the new connection with v1. Would it not
> > make more sense to have the client connect and send its initial banner,
> > and then let the server decide what sort of banner to send based on
> > what the client sent?
> 
> This is why the v2 banner has the features values (%lx with supported and 
> required bits).  Clients and servers (connecter and accepters, really, 
> since servers talk to each other too) can concurrently announce what they 
> support and require and then go from there.  It doesn't help with the v1 
> transition, but the addrvec changes (entity_addr_t now has a type 
> indicating which protocol is spoken, and multiple addrs can be listed for 
> any server) along with a mon port change (which we have to do anyway to 
> switch to our IANA assigned port) handle the v1 transition.
> 

Ahh ok, I didn't realize ceph was squatting on a port! Ok, then if
you're planning to switch to a new well-known port anyway, then a clean
break like this makes more sense.

I'll confess though that I don't quite understand the whole point of
the entity_addr_t's. What purpose does it serve to exchange network
addresses here? Is it simply to inform the peer of other ways that it
can be reached? What happens if I pick up my laptop that's acting as a
ceph client and wander onto a new network. Does anything break?

> Are there other reasons to do the client banner first?  
> 
> 

I think that's the main one.

The only other reason I can think of might be to guard against
information disclosure to port scanners. If you require the client to
send a banner first, then the server could drop the connection if it
doesn't look right without ever sending anything.

That said, given that we're going to be using a IANA designated port in
most cases, that's not going to be terribly useful. The port scanners
would just send a bogus but legit-looking banner to that port.

> > 
> > > > 
> > > > 
> > > > > On Sat, Sep 10, 2016 at 11:37 AM, Haomai Wang <haomai@xxxxxxxx> wrote:
> > > > >
> > > > >
> > > > > > On Sat, Sep 10, 2016 at 5:14 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > > > >>
> > > > >> On Sat, 10 Sep 2016, Haomai Wang wrote:
> > > > >> > @sage in current impl, when logic fault like state mismatch, data format
> > > > >> > mismatch or anything else, connection will abort session via closing
> > > > >> > socket.
> > > > >> > And the peer side would do something according to policy too.
> > > > >> > In msgr v2 when introducing multi streams in the same connection, we
> > > > >> > can't
> > > > >> > simply abort socket to indicate something wrong now. I think we need to
> > > > >> > introduce TAG_ABORT with error message.
> > > > >> >
> > > > >> > But the peer side may stuck into a state like reading enough data as
> > > > >> > "length" indicate. It may miss the TAG_ABORT notify or other reconnect
> > > > >> > tag.
> > > > >> > A tricky thing is we use tcp OOB bit to send which exactly trigger
> > > > >> > "urgent"
> > > > >> > signal when receiving, but it only occur 1 byte in tcp proto which can
> > > > >> > be
> > > > >> > used here to indicate the stream id(32bit designed now).
> > > > >> >
> > > > >> > What's more, multi stream mixed within one socket may make trouble to
> > > > >> > message receiving when potential tcp packet silent error. So it looks we
> > > > >> > can't use the same socket the multi stream to meet our demands.
> > > > >> >
> > > > >> > Any idea?
> > > > >>
> > > > >> I think we intorduce a TAG_ABORT to interrupt the stream.  And then we
> > > > >> have to assume that the low-level msgr2 implementation that reads and
> > > > >> writes frames (which have their own frame_len) is not buggy.  In practice,
> > > > >> the aborts tend to happen because we get a message we don't understand
> > > > >> (version mismatch, encoding compatibility bug, etc.), and that'll happen
> > > > >> at a higher level after frames have been read... so a TAG_ABORT will be
> > > > >> sufficient.
> > > > >
> > > > >
> > > > > yes, if frame is ok. It should be ok.... Let's go through this firstly...
> > > > > The worse case is the frame length is not expected as data transferred.
> > > > >
> > > > >>
> > > > >>
> > > > >> Also, we can have an option to make aborts close the socket.  That'll be
> > > > >> fine for now anyway, although later it's probably to disruptive when
> > > > >> multiple streams are sharing a socket...
> > > > >>
> > > > >> sage
> > > > >>
> > > > >>
> > > > >>  >
> > > > >> >
> > > > >> >
> > > > > >> > On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > > > >> >       On Sat, 11 Jun 2016, Marcus Watts wrote:
> > > > >> >       > If the client doesn't look at "features" before it sends
> > > > >> >       stuff, it
> > > > >> >       > will not be able to be very smart about taking advantage of
> > > > >> >       some
> > > > >> >       > future better method.  In fact, there isn't much advantage
> > > > >> >       > to the server sending anything early - it could just as easily
> > > > >> >       > wait until after it's seen the clients request.
> > > > >> >       >
> > > > >> >       > Failing hard & retrying on a failed reconnect is going to be
> > > > >> >       slower.
> > > > >> >       > On the bright side, at least it shouldn't happen often.
> > > > >> >
> > > > >> >       Yep.  Well, I think it is the client's (limited choice).  If it
> > > > >> >       needs to
> > > > >> >       know the server features, it needs to either wait for them, or
> > > > >> >       make some
> > > > >> >       optimistic choice and be prepared to pay the cost of a mistake.
> > > > >> >       We should
> > > > >> >       give the client choice, though, if we can.
> > > > >> >
> > > > >> >       > If you're sending encryption (w/ different auth or keys) from
> > > > >> >       several
> > > > >> >       > different streams, how are you planning to indicate which bits
> > > > >> >       > go with which scheme?, and which bits are you planning to
> > > > >> >       encrypt
> > > > >> >       > and which not?
> > > > >> >
> > > > >> >       This is what he stream ids are for, and why the outer portion of
> > > > >> >       the frame
> > > > >> >       is unencrypted.  See
> > > > >> >
> > > > >> >
> > > > >> > https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330
> > > > >> >       c44b436R68
> > > > >> >
> > > > >> >        +  stream_id (le32)
> > > > >> >        +  frame_len (le32)
> > > > >> >        +  tag (TAG_* byte)
> > > > >> >        +  payload
> > > > >> >        +  [payload padding -- only present after stream auth phase]
> > > > >> >        +  [signature -- only present after stream auth phase]
> > > > >> >
> > > > >> >       The tag and payload (and padding) would be encrypted or signed,
> > > > >> >       but not
> > > > >> >       the stream id and frame_len.
> > > > >> >
> > > > >> >       > Byte count limits.  Basically, you don't want collisions
> > > > >> >       because
> > > > >> >       > of duplicated keys or data.  This depends on your crypto
> > > > >> >       system,
> > > > >> >       > so, for instance, you should not encrypt with one key more
> > > > >> >       than
> > > > >> >       >       aes, cbc        about 2^68 bytes
> > > > >> >       >       aes, ctr        exactly 2^128 bytes
> > > > >> >       > more generally, this depends on mode, blocksize, ...
> > > > >> >       > This applies across *all* uses of the key - and so you would
> > > > >> >       > generally want to use the session key directly as little as
> > > > >> >       possible.
> > > > >> >       > (in particular, using the session key for ctr directly would
> > > > >> >       be very very bad.)
> > > > >> >       >
> > > > >> >       > If you've got multiple streams going already, you should be
> > > > >> >       able
> > > > >> >       > to include a fairly simple rekey method with little effort.
> > > > >> >       > For instance, as part of the method, you could,
> > > > >> >       >       up front as part of the method
> > > > >> >       >               send a per-stream key encrypted under the shared
> > > > >> >       secret.
> > > > >> >       >       prepend to the first data sent in a payload
> > > > >> >       >               byte limit, stream key #0 (encrypted under the
> > > > >> >       per-stream key)
> > > > >> >       >               then encrypt the next N bytes with stream key #0
> > > > >> >       >       when the byte limit is reached, prepend to the
> > > > >> >       >               next data sent in a payload
> > > > >> >       >               byte limit, stream key #1 (encrypted under the
> > > > >> >       per-stream key)
> > > > >> >       >               then encrypt the next N bytes with stream key #1
> > > > >> >       >       &etc.
> > > > >> >
> > > > >> >       Good idea.  If I understand correctly, it means that the
> > > > >> >       session_key is
> > > > >> >       only used to send the new/next random encryption key, and if we
> > > > >> >       make the
> > > > >> >       byte limit part of the initial protocol we get the rotation we
> > > > >> >       need.  It
> > > > >> >       might be simpler to do it as a frame limit instead of byte
> > > > >> >       limit, and
> > > > >> >       assume max-length frames (2^32 bytes).  We could still be super
> > > > >> >       conservative and rotate the encryption key every 2^16 messages
> > > > >> >       or
> > > > >> >       something...?  And rotating the key on frame boundaries should
> > > > >> >       be much
> > > > >> >       simpler to implement.
> > > > >> >
> > > > >> >       Anyway, that part can be defined a bit later, I think.
> > > > >> >
> > > > >> >       Thanks!
> > > > >> >       sage
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >
> > > > >
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > -- 
> > > Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
> > -- 
> > > Jeff Layton <jlayton@xxxxxxxxxx>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
-- 
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html