Re: msgr2 protocol

Sage Weil <sweil@xxxxxxxxxx> · Sun, 11 Sep 2016 17:05:21 +0000 (UTC)

On Sat, 10 Sep 2016, Haomai Wang wrote:
> About thing is v1/v2 compatible. I rethink the details:
> 
> 0. we need to define the new banner which must longer than before("ceph v027")
> 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n"
> 2. both in simle/async codes, server side must issue banner firstly
> 3. if server side supports v2 and client only supports v1, client will
> receive 9 bytes and do memcmp, then reject this connection via closing
> socket. So server side could retry the older version
> 4. if server side only supports v1 and client supports v2, client
> according banner to reply corresponding banner
> 
> This tricky design is based on the implementation fact "accept side
> issue the banner firstly" and "new banner is longer than old banner",
> and this way doesn't need to involve other dependences like mon port
> changes.
> 
> Does this way has problem?

I was thinking we avoid this problem and any hacky initial handshakes by 
speaking v2 on the new port and v1 on the old port.  Then the monmap has 
an entity_addrvec_t with both a v1 and v2 address (encoding with just the 
v1 address for old clients). Same for the OSDs.

The v1 handshake just isn't extensible (how do you tell a v2 client 
connecting that you speak both v1 and v2?).

sage

> 
> 
> On Sat, Sep 10, 2016 at 11:37 AM, Haomai Wang <haomai@xxxxxxxx> wrote:
> >
> >
> > On Sat, Sep 10, 2016 at 5:14 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> >>
> >> On Sat, 10 Sep 2016, Haomai Wang wrote:
> >> > @sage in current impl, when logic fault like state mismatch, data format
> >> > mismatch or anything else, connection will abort session via closing
> >> > socket.
> >> > And the peer side would do something according to policy too.
> >> > In msgr v2 when introducing multi streams in the same connection, we
> >> > can't
> >> > simply abort socket to indicate something wrong now. I think we need to
> >> > introduce TAG_ABORT with error message.
> >> >
> >> > But the peer side may stuck into a state like reading enough data as
> >> > "length" indicate. It may miss the TAG_ABORT notify or other reconnect
> >> > tag.
> >> > A tricky thing is we use tcp OOB bit to send which exactly trigger
> >> > "urgent"
> >> > signal when receiving, but it only occur 1 byte in tcp proto which can
> >> > be
> >> > used here to indicate the stream id(32bit designed now).
> >> >
> >> > What's more, multi stream mixed within one socket may make trouble to
> >> > message receiving when potential tcp packet silent error. So it looks we
> >> > can't use the same socket the multi stream to meet our demands.
> >> >
> >> > Any idea?
> >>
> >> I think we intorduce a TAG_ABORT to interrupt the stream.  And then we
> >> have to assume that the low-level msgr2 implementation that reads and
> >> writes frames (which have their own frame_len) is not buggy.  In practice,
> >> the aborts tend to happen because we get a message we don't understand
> >> (version mismatch, encoding compatibility bug, etc.), and that'll happen
> >> at a higher level after frames have been read... so a TAG_ABORT will be
> >> sufficient.
> >
> >
> > yes, if frame is ok. It should be ok.... Let's go through this firstly...
> > The worse case is the frame length is not expected as data transferred.
> >
> >>
> >>
> >> Also, we can have an option to make aborts close the socket.  That'll be
> >> fine for now anyway, although later it's probably to disruptive when
> >> multiple streams are sharing a socket...
> >>
> >> sage
> >>
> >>
> >>  >
> >> >
> >> >
> >> > On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> >> >       On Sat, 11 Jun 2016, Marcus Watts wrote:
> >> >       > If the client doesn't look at "features" before it sends
> >> >       stuff, it
> >> >       > will not be able to be very smart about taking advantage of
> >> >       some
> >> >       > future better method.  In fact, there isn't much advantage
> >> >       > to the server sending anything early - it could just as easily
> >> >       > wait until after it's seen the clients request.
> >> >       >
> >> >       > Failing hard & retrying on a failed reconnect is going to be
> >> >       slower.
> >> >       > On the bright side, at least it shouldn't happen often.
> >> >
> >> >       Yep.  Well, I think it is the client's (limited choice).  If it
> >> >       needs to
> >> >       know the server features, it needs to either wait for them, or
> >> >       make some
> >> >       optimistic choice and be prepared to pay the cost of a mistake.
> >> >       We should
> >> >       give the client choice, though, if we can.
> >> >
> >> >       > If you're sending encryption (w/ different auth or keys) from
> >> >       several
> >> >       > different streams, how are you planning to indicate which bits
> >> >       > go with which scheme?, and which bits are you planning to
> >> >       encrypt
> >> >       > and which not?
> >> >
> >> >       This is what he stream ids are for, and why the outer portion of
> >> >       the frame
> >> >       is unencrypted.  See
> >> >
> >> >
> >> > https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330
> >> >       c44b436R68
> >> >
> >> >        +  stream_id (le32)
> >> >        +  frame_len (le32)
> >> >        +  tag (TAG_* byte)
> >> >        +  payload
> >> >        +  [payload padding -- only present after stream auth phase]
> >> >        +  [signature -- only present after stream auth phase]
> >> >
> >> >       The tag and payload (and padding) would be encrypted or signed,
> >> >       but not
> >> >       the stream id and frame_len.
> >> >
> >> >       > Byte count limits.  Basically, you don't want collisions
> >> >       because
> >> >       > of duplicated keys or data.  This depends on your crypto
> >> >       system,
> >> >       > so, for instance, you should not encrypt with one key more
> >> >       than
> >> >       >       aes, cbc        about 2^68 bytes
> >> >       >       aes, ctr        exactly 2^128 bytes
> >> >       > more generally, this depends on mode, blocksize, ...
> >> >       > This applies across *all* uses of the key - and so you would
> >> >       > generally want to use the session key directly as little as
> >> >       possible.
> >> >       > (in particular, using the session key for ctr directly would
> >> >       be very very bad.)
> >> >       >
> >> >       > If you've got multiple streams going already, you should be
> >> >       able
> >> >       > to include a fairly simple rekey method with little effort.
> >> >       > For instance, as part of the method, you could,
> >> >       >       up front as part of the method
> >> >       >               send a per-stream key encrypted under the shared
> >> >       secret.
> >> >       >       prepend to the first data sent in a payload
> >> >       >               byte limit, stream key #0 (encrypted under the
> >> >       per-stream key)
> >> >       >               then encrypt the next N bytes with stream key #0
> >> >       >       when the byte limit is reached, prepend to the
> >> >       >               next data sent in a payload
> >> >       >               byte limit, stream key #1 (encrypted under the
> >> >       per-stream key)
> >> >       >               then encrypt the next N bytes with stream key #1
> >> >       >       &etc.
> >> >
> >> >       Good idea.  If I understand correctly, it means that the
> >> >       session_key is
> >> >       only used to send the new/next random encryption key, and if we
> >> >       make the
> >> >       byte limit part of the initial protocol we get the rotation we
> >> >       need.  It
> >> >       might be simpler to do it as a frame limit instead of byte
> >> >       limit, and
> >> >       assume max-length frames (2^32 bytes).  We could still be super
> >> >       conservative and rotate the encryption key every 2^16 messages
> >> >       or
> >> >       something...?  And rotating the key on frame boundaries should
> >> >       be much
> >> >       simpler to implement.
> >> >
> >> >       Anyway, that part can be defined a bit later, I think.
> >> >
> >> >       Thanks!
> >> >       sage
> >> >
> >> >
> >> >
> >> >
> >
> >
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html