Re: msgr2 protocol

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 27, 2016 at 12:41 PM, Haomai Wang <haomai@xxxxxxxx> wrote:
> On Fri, May 27, 2016 at 2:17 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> I wrote up a basic proposal for the new msgr2 protocol:
>>
>>         http://pad.ceph.com/p/msgr2
>>
>> It is pretty similar to the current protocol, with a few key changes:
>>
>> 1. The initial banner has a version number for protocl features supported
>> and required.  This will allow optional behavior later.  The current
>> protocol doesn't allow this (the banner string is fixed and has to match
>> verbatim).
>
> Does msgrv2 need to talk with v1peer? Or we just reject this handshake?
>
> If we reject v1, is it possible give our a chance to reset message version?
>
>>
>> 2. The auth handshake is a low-level msgr exchange now.  This more or less
>> matches the MAuth and MAuthReply exchange with the mon.  Also, the
>> authenticator/ticket presentation for established clients can be sent here
>> as part of this exchange, instead of as part of the msg_connect and
>> msg_connect_reply exchnage.
>
> S: TAG_AUTH_METHODS          # list methods
>     __le32 num_methods;
>     __le32 methods[num_methods];   // CEPH_AUTH_{NONE, CEPHX}
>
> From my view, it looks we need to force a method instead of letting
> peer side select? What's use case that we allow client side to decide
> method?
>
>>
>> 3. The identification of peers during connect is moved to the TAG_IDENT
>> stage.  This way it could happen after authentication and/or encryption,
>> if we like.  (Not sure it matters.)
>
> C or S: TAG_ENCRYPT_BEGIN    # signal that all subsequent traffic will
> be encrypted
>
> __le32 len
>
> <method specific payload>
>
> do we also need encrypt info handshake? like key/algorithm?
>
>>
>> 4. Signatures are a separate message now that follows the previous
>> message.  If a message doesn't have a signature that follows, it is
>> dropped.  Once authenticated we can sign all the other handshake exchanges
>> (TAG_IDENT, etc.) as well as the messages themselves.
>>
>> 5. The reconnect behavior for stateful connections is a separate
>> exchange. This keeps the stateless connections free of clutter.
>
> It will be a big task ......
>
>>
>> 6. A few changes in the auth_none and cephx integratoin will be needed.
>> For example, all the current stubs assume that authentication happens over
>> MAuth message and authorization happens in an authorizer blob in
>> ceph_msg_connect.  Now both are part of TAG_AUTH_REQUEST, so we'll need to
>> multiplex the cephx message blobs. Also, because the IDENT exchanges
>> happens later, we may need to pass additional info in the auth handshake
>> messages (like the peer type, or whatever else is needed).
>
> Hmm, only need peer type? if address is needed, IDENT stage must
> happen before auth
>
>>
>> 7. Lots of messages can go either way, and I tried ot avoid a strict
>> request/response model so that things could be pipelined, and we'd spend a
>> minimal amount of time waiting for a response from the other end.  For
>> example,
>>
>> C:
>>  initiates connection
>> S:
>>  accepts connection
>>  -> banner
>>  -> TAG_AUTH_METHODS
>> C:
>>  -> banner
>>  -> TAG_AUTH_SET_METHOD
>>  -> TAG_AUTH_AUTH_REQUEST
>> S:
>>  -> TAG_AUTH_REPLY
>> C:
>>  -> TAG_ENCRYPT_BEGIN
>>  -> TAG_IDENT
>>  -> TAG_SIGNATURE
>> S:
>>  -> TAG_ENCRYPT_BEGIN
>>  -> TAG_IDENT
>>  -> TAG_SIGNATURE
>> C:
>>  -> TAG_START
>>  -> TAG_SIGNATURE
>>  -> TAG_MSG
>>  -> TAG_SIGNATURE
>>     ...
>> S:
>>  -> TAG_MSG
>>  -> TAG_SIGNATURE
>>     ...
>>
>> Comments, please!  The exhange is a bit less structured as far as who
>> sends what message, with the idea that we could pipeline a lot of it, but
>> it may end up being too ambiguous.  Let me know what you think...

we also could add ack_seq to ceph_msg_header to avoid extra ack
tag(1+8). For heavy client io or repop, ack aggregation could help to
reduce a lot kernel cpu util.

>
> we may also change ceph_msg_header/ceph_msg_footer :
>
> struct ceph_msg_header {
> __le64 seq;       /* message seq# for this session */
> __le64 tid;       /* transaction id */
> __le16 type;      /* message type */
> __le16 priority;  /* priority.  higher value == higher priority */
> __le16 version;   /* version of message encoding */
>
> __le32 front_len; /* bytes in main payload */
> __le32 middle_len;/* bytes in middle payload */
> __le32 data_len;  /* bytes of data payload */
> __le16 data_off;  /* sender: include full offset;
>     receiver: mask against ~PAGE_MASK */
>
> struct ceph_entity_name src;
>
> /* oldest code we think can decode this.  unknown if zero. */
> __le16 compat_version;
> __le16 reserved;
> __le32 crc;       /* header crc32c */
> } __attribute__ ((packed));
>
> we may drop middle_len, src thing.
>
> And could we drop footer and move crc to header? Because for each
> message, we always add a system call for footer since it can't be
> prefetched in userspace memory. Most of rpc impl only add a header to
> actual data.
>
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux