On Fri, May 27, 2016 at 12:41 PM, Haomai Wang <haomai@xxxxxxxx> wrote: > On Fri, May 27, 2016 at 2:17 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> I wrote up a basic proposal for the new msgr2 protocol: >> >> http://pad.ceph.com/p/msgr2 >> >> It is pretty similar to the current protocol, with a few key changes: >> >> 1. The initial banner has a version number for protocl features supported >> and required. This will allow optional behavior later. The current >> protocol doesn't allow this (the banner string is fixed and has to match >> verbatim). > > Does msgrv2 need to talk with v1peer? Or we just reject this handshake? > > If we reject v1, is it possible give our a chance to reset message version? > >> >> 2. The auth handshake is a low-level msgr exchange now. This more or less >> matches the MAuth and MAuthReply exchange with the mon. Also, the >> authenticator/ticket presentation for established clients can be sent here >> as part of this exchange, instead of as part of the msg_connect and >> msg_connect_reply exchnage. > > S: TAG_AUTH_METHODS # list methods > __le32 num_methods; > __le32 methods[num_methods]; // CEPH_AUTH_{NONE, CEPHX} > > From my view, it looks we need to force a method instead of letting > peer side select? What's use case that we allow client side to decide > method? > >> >> 3. The identification of peers during connect is moved to the TAG_IDENT >> stage. This way it could happen after authentication and/or encryption, >> if we like. (Not sure it matters.) > > C or S: TAG_ENCRYPT_BEGIN # signal that all subsequent traffic will > be encrypted > > __le32 len > > <method specific payload> > > do we also need encrypt info handshake? like key/algorithm? > >> >> 4. Signatures are a separate message now that follows the previous >> message. If a message doesn't have a signature that follows, it is >> dropped. Once authenticated we can sign all the other handshake exchanges >> (TAG_IDENT, etc.) as well as the messages themselves. >> >> 5. The reconnect behavior for stateful connections is a separate >> exchange. This keeps the stateless connections free of clutter. > > It will be a big task ...... > >> >> 6. A few changes in the auth_none and cephx integratoin will be needed. >> For example, all the current stubs assume that authentication happens over >> MAuth message and authorization happens in an authorizer blob in >> ceph_msg_connect. Now both are part of TAG_AUTH_REQUEST, so we'll need to >> multiplex the cephx message blobs. Also, because the IDENT exchanges >> happens later, we may need to pass additional info in the auth handshake >> messages (like the peer type, or whatever else is needed). > > Hmm, only need peer type? if address is needed, IDENT stage must > happen before auth > >> >> 7. Lots of messages can go either way, and I tried ot avoid a strict >> request/response model so that things could be pipelined, and we'd spend a >> minimal amount of time waiting for a response from the other end. For >> example, >> >> C: >> initiates connection >> S: >> accepts connection >> -> banner >> -> TAG_AUTH_METHODS >> C: >> -> banner >> -> TAG_AUTH_SET_METHOD >> -> TAG_AUTH_AUTH_REQUEST >> S: >> -> TAG_AUTH_REPLY >> C: >> -> TAG_ENCRYPT_BEGIN >> -> TAG_IDENT >> -> TAG_SIGNATURE >> S: >> -> TAG_ENCRYPT_BEGIN >> -> TAG_IDENT >> -> TAG_SIGNATURE >> C: >> -> TAG_START >> -> TAG_SIGNATURE >> -> TAG_MSG >> -> TAG_SIGNATURE >> ... >> S: >> -> TAG_MSG >> -> TAG_SIGNATURE >> ... >> >> Comments, please! The exhange is a bit less structured as far as who >> sends what message, with the idea that we could pipeline a lot of it, but >> it may end up being too ambiguous. Let me know what you think... we also could add ack_seq to ceph_msg_header to avoid extra ack tag(1+8). For heavy client io or repop, ack aggregation could help to reduce a lot kernel cpu util. > > we may also change ceph_msg_header/ceph_msg_footer : > > struct ceph_msg_header { > __le64 seq; /* message seq# for this session */ > __le64 tid; /* transaction id */ > __le16 type; /* message type */ > __le16 priority; /* priority. higher value == higher priority */ > __le16 version; /* version of message encoding */ > > __le32 front_len; /* bytes in main payload */ > __le32 middle_len;/* bytes in middle payload */ > __le32 data_len; /* bytes of data payload */ > __le16 data_off; /* sender: include full offset; > receiver: mask against ~PAGE_MASK */ > > struct ceph_entity_name src; > > /* oldest code we think can decode this. unknown if zero. */ > __le16 compat_version; > __le16 reserved; > __le32 crc; /* header crc32c */ > } __attribute__ ((packed)); > > we may drop middle_len, src thing. > > And could we drop footer and move crc to header? Because for each > message, we always add a system call for footer since it can't be > prefetched in userspace memory. Most of rpc impl only add a header to > actual data. > >> >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html