Re: msgr2 protocol

Sage Weil <sweil@xxxxxxxxxx> · Thu, 2 Jun 2016 11:43:57 -0400 (EDT)

Based on the discussion during CDM yesterday I wrote up a nicer-looking 
spec of the protocol in rst:

	https://github.com/ceph/ceph/pull/9461

Please let me know if this looks right.  I have two questions:

1. Is TAG_START is really necessary?  I guess it doesn't hurt, and makes 
it easy to add flags later.

2. We don't explicitly have anything here that indicates a session is 
stateless or stateful.  Currently this is determined by the Policy stuff 
on either end and the peers just happen to agree.  Setting/asserting 
it explicitly has part of the handshake seems like a good idea.  Maybe a 
flags field in the TAG_IDENT message, with a flags for lossy/lossess, 
whether we initiate connections (true for client or p2p servers)?

sage

On Sat, 28 May 2016, Yehuda Sadeh-Weinraub wrote:

> On Fri, May 27, 2016 at 10:37 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > On Fri, 27 May 2016, Yehuda Sadeh-Weinraub wrote:
> >> On Thu, May 26, 2016 at 11:17 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> >> > I wrote up a basic proposal for the new msgr2 protocol:
> >> >
> >> >         http://pad.ceph.com/p/msgr2
> >> >
> >> > It is pretty similar to the current protocol, with a few key changes:
> >> >
> >> > 1. The initial banner has a version number for protocl features supported
> >> > and required.  This will allow optional behavior later.  The current
> >> > protocol doesn't allow this (the banner string is fixed and has to match
> >> > verbatim).
> >> >
> >> > 2. The auth handshake is a low-level msgr exchange now.  This more or less
> >> > matches the MAuth and MAuthReply exchange with the mon.  Also, the
> >> > authenticator/ticket presentation for established clients can be sent here
> >> > as part of this exchange, instead of as part of the msg_connect and
> >> > msg_connect_reply exchnage.
> >> >
> >> > 3. The identification of peers during connect is moved to the TAG_IDENT
> >> > stage.  This way it could happen after authentication and/or encryption,
> >> > if we like.  (Not sure it matters.)
> >> >
> >> > 4. Signatures are a separate message now that follows the previous
> >> > message.  If a message doesn't have a signature that follows, it is
> >> > dropped.  Once authenticated we can sign all the other handshake exchanges
> >> > (TAG_IDENT, etc.) as well as the messages themselves.
> >> >
> >>
> >> Is there a reason why the signature needs to be a separate message? It
> >> would add extra overhead, and it seems to me that it would complicate
> >> implementation (in terms of message state and such).
> >
> > It doesn't have to be--I was just wanting to keep things simple.  We could
> > similarly make it part of the underlying format, e.g.,
> >
> >  tag byte
> >  8 byte signature
> >  payload
> 
> signature should come after payload, but yeah. Might need to define
> extended envelope to allow future extensions.
> 
> >
> > or whatever.  That's basically the same thing, except we save 1 byte.
> >
> >> > 5. The reconnect behavior for stateful connections is a separate
> >> > exchange. This keeps the stateless connections free of clutter.
> >> >
> >> > 6. A few changes in the auth_none and cephx integratoin will be needed.
> >> > For example, all the current stubs assume that authentication happens over
> >> > MAuth message and authorization happens in an authorizer blob in
> >> > ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
> >> > multiplex the cephx message blobs. Also, because the IDENT exchanges
> >> > happens later, we may need to pass additional info in the auth handshake
> >> > messages (like the peer type, or whatever else is needed).
> >> >
> >> > 7. Lots of messages can go either way, and I tried ot avoid a strict
> >> > request/response model so that things could be pipelined, and we'd spend a
> >> > minimal amount of time waiting for a response from the other end.  For
> >> > example,
> >> >
> >> > C:
> >> >  initiates connection
> >> > S:
> >> >  accepts connection
> >> >  -> banner
> >> >  -> TAG_AUTH_METHODS
> >> > C:
> >> >  -> banner
> >> >  -> TAG_AUTH_SET_METHOD
> >> >  -> TAG_AUTH_AUTH_REQUEST
> >> > S:
> >> >  -> TAG_AUTH_REPLY
> >> > C:
> >> >  -> TAG_ENCRYPT_BEGIN
> >> >  -> TAG_IDENT
> >> >  -> TAG_SIGNATURE
> >>
> >> Can we have the client start authenticating with some predetermined
> >> auth params, and resort to having the server responding with
> >> AUTH_METHODS only if it doesn't support the method selected by the
> >> client. Even if not having it preconfigured, the auth method usually
> >> doesn't change across connection instances, so we can have the client
> >> cache that info per server. That would then be something like this:
> >>
> >> a first connection:
> >>
> >> C:
> >>  initiates connection
> >>  -> banner
> >>  -> TAG_AUTH_GET_METHODS <-- be explicit
> >>  -> TAG_AUTH_SET_METHOD  <-- opportunistically trying a specific
> >> method type anyway
> >>  -> TAG_AUTH_AUTH_REQUEST
> >>
> >> S:
> >>  accepts connection
> >>  -> banner
> >>  -> TAG_AUTH_REPLY
> >>
> >>
> >> a followup connection:
> >>
> >>
> >> C:
> >>  initiates connection
> >>  -> banner
> >>  -> TAG_AUTH_SET_METHOD
> >>  -> TAG_AUTH_AUTH_REQUEST
> >>
> >> S:
> >>  accepts connection
> >>  -> banner
> >>  -> TAG_AUTH_REPLY
> >
> > Yeah.. of even just make the initial connection try it's preferred method
> > and only do the GET_METHODS if it is rejected.
> >
> 
> Right. In any case, the protocol should enable this flexibility.
> 
> 
> > If you do a connect and immediately write a few bytes to teh TCP stream,
> > does that actaully translate to fewer packets?  I was guessing that the
> > server writing the first bytes of the exchange would be fine but if it
> > speeds things up for the client to optimistically start the exchange too
> > we may as well...
> >
> 
> While haven't really looked at it recently, I don't think it'd be
> possible to embed data with the SYN packet using the plain vanilla tcp
> implementation. However, I believe that doing connect() and sending
> data immediately following it should improve things, specifically if
> doing async connect (as with the async messenger), but this still
> needs to be proven.
> 
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html