On Saturday, August 08/10/19, 2019 at 02:05:00 +0530, Tom Talpey wrote: > On 8/9/2019 9:52 AM, Bernard Metzler wrote: > > -----"Tom Talpey" <tom@xxxxxxxxxx> wrote: ----- > > > >> To: "Bernard Metzler" <BMT@xxxxxxxxxxxxxx>, "Krishnamraju Eraparaju" > >> <krishna2@xxxxxxxxxxx> > >> From: "Tom Talpey" <tom@xxxxxxxxxx> > >> Date: 08/09/2019 02:27PM > >> Cc: "jgg@xxxxxxxx" <jgg@xxxxxxxx>, "linux-rdma@xxxxxxxxxxxxxxx" > >> <linux-rdma@xxxxxxxxxxxxxxx>, "Potnuri Bharat Teja" > >> <bharat@xxxxxxxxxxx>, "Nirranjan Kirubaharan" <nirranjan@xxxxxxxxxxx> > >> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler tries > >> to read beyond MPA message > >> > >> On 8/9/2019 6:29 AM, Bernard Metzler wrote: > >>> -----"Krishnamraju Eraparaju" <krishna2@xxxxxxxxxxx> wrote: ----- > >>> > >>>> To: "Bernard Metzler" <BMT@xxxxxxxxxxxxxx> > >>>> From: "Krishnamraju Eraparaju" <krishna2@xxxxxxxxxxx> > >>>> Date: 08/08/2019 06:47PM > >>>> Cc: "Tom Talpey" <tom@xxxxxxxxxx>, "jgg@xxxxxxxx" <jgg@xxxxxxxx>, > >>>> "linux-rdma@xxxxxxxxxxxxxxx" <linux-rdma@xxxxxxxxxxxxxxx>, > >> "Potnuri > >>>> Bharat Teja" <bharat@xxxxxxxxxxx>, "Nirranjan Kirubaharan" > >>>> <nirranjan@xxxxxxxxxxx> > >>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler > >> tries > >>>> to read beyond MPA message > >>>> > >>>> On Thursday, August 08/08/19, 2019 at 15:05:12 +0000, Bernard > >> Metzler > >>>> wrote: > >>>>> -----"Krishnamraju Eraparaju" <krishna2@xxxxxxxxxxx> wrote: ----- > >>>>> > >>>>>> To: "Tom Talpey" <tom@xxxxxxxxxx>, <BMT@xxxxxxxxxxxxxx> > >>>>>> From: "Krishnamraju Eraparaju" <krishna2@xxxxxxxxxxx> > >>>>>> Date: 08/05/2019 07:26PM > >>>>>> Cc: "jgg@xxxxxxxx" <jgg@xxxxxxxx>, "linux-rdma@xxxxxxxxxxxxxxx" > >>>>>> <linux-rdma@xxxxxxxxxxxxxxx>, "Potnuri Bharat Teja" > >>>>>> <bharat@xxxxxxxxxxx>, "Nirranjan Kirubaharan" > >>>> <nirranjan@xxxxxxxxxxx> > >>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler > >>>> tries > >>>>>> to read beyond MPA message > >>>>>> > >>>>>> On Friday, August 08/02/19, 2019 at 18:17:37 +0530, Tom Talpey > >>>> wrote: > >>>>>>> On 8/2/2019 7:18 AM, Bernard Metzler wrote: > >>>>>>>> -----"Tom Talpey" <tom@xxxxxxxxxx> wrote: ----- > >>>>>>>> > >>>>>>>>> To: "Bernard Metzler" <BMT@xxxxxxxxxxxxxx>, "Krishnamraju > >>>>>> Eraparaju" > >>>>>>>>> <krishna2@xxxxxxxxxxx> > >>>>>>>>> From: "Tom Talpey" <tom@xxxxxxxxxx> > >>>>>>>>> Date: 08/01/2019 08:54PM > >>>>>>>>> Cc: jgg@xxxxxxxx, linux-rdma@xxxxxxxxxxxxxxx, > >>>>>> bharat@xxxxxxxxxxx, > >>>>>>>>> nirranjan@xxxxxxxxxxx, krishn2@xxxxxxxxxxx > >>>>>>>>> Subject: [EXTERNAL] Re: [PATCH for-rc] siw: MPA Reply handler > >>>>>> tries > >>>>>>>>> to read beyond MPA message > >>>>>>>>> > >>>>>>>>> On 8/1/2019 6:56 AM, Bernard Metzler wrote: > >>>>>>>>>> -----"Krishnamraju Eraparaju" <krishna2@xxxxxxxxxxx> wrote: > >>>>>> ----- > >>>>>>>>>> > >>>>>>>>>>> To: jgg@xxxxxxxx, bmt@xxxxxxxxxxxxxx > >>>>>>>>>>> From: "Krishnamraju Eraparaju" <krishna2@xxxxxxxxxxx> > >>>>>>>>>>> Date: 07/31/2019 12:34PM > >>>>>>>>>>> Cc: linux-rdma@xxxxxxxxxxxxxxx, bharat@xxxxxxxxxxx, > >>>>>>>>>>> nirranjan@xxxxxxxxxxx, krishn2@xxxxxxxxxxx, "Krishnamraju > >>>>>>>>> Eraparaju" > >>>>>>>>>>> <krishna2@xxxxxxxxxxx> > >>>>>>>>>>> Subject: [EXTERNAL] [PATCH for-rc] siw: MPA Reply handler > >>>>>> tries to > >>>>>>>>>>> read beyond MPA message > >>>>>>>>>>> > >>>>>>>>>>> while processing MPA Reply, SIW driver is trying to read > >>>> extra > >>>>>> 4 > >>>>>>>>>>> bytes > >>>>>>>>>>> than what peer has advertised as private data length. > >>>>>>>>>>> > >>>>>>>>>>> If a FPDU data is received before even siw_recv_mpa_rr() > >>>>>> completed > >>>>>>>>>>> reading MPA reply, then ksock_recv() in siw_recv_mpa_rr() > >>>>>> could > >>>>>>>>> also > >>>>>>>>>>> read FPDU, if "size" is larger than advertised MPA reply > >>>>>> length. > >>>>>>>>>>> > >>>>>>>>>>> 501 static int siw_recv_mpa_rr(struct siw_cep *cep) > >>>>>>>>>>> 502 { > >>>>>>>>>>> ............. > >>>>>>>>>>> 572 > >>>>>>>>>>> 573 if (rcvd > to_rcv) > >>>>>>>>>>> 574 return -EPROTO; <----- Failure here > >>>>>>>>>>> > >>>>>>>>>>> Looks like the intention here is to throw an ERROR if the > >>>>>> received > >>>>>>>>>>> data > >>>>>>>>>>> is more than the total private data length advertised by > >>>> the > >>>>>> peer. > >>>>>>>>>>> But > >>>>>>>>>>> reading beyond MPA message causes siw_cm to generate > >>>>>>>>>>> RDMA_CM_EVENT_CONNECT_ERROR event when TCP socket recv > >>>> buffer > >>>>>> is > >>>>>>>>>>> already > >>>>>>>>>>> queued with FPDU messages. > >>>>>>>>>>> > >>>>>>>>>>> Hence, this function should only read upto private data > >>>>>> length. > >>>>>>>>>>> > >>>>>>>>>>> Signed-off-by: Krishnamraju Eraparaju > >>>> <krishna2@xxxxxxxxxxx> > >>>>>>>>>>> --- > >>>>>>>>>>> drivers/infiniband/sw/siw/siw_cm.c | 4 ++-- > >>>>>>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>>>>>>>>>> > >>>>>>>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c > >>>>>>>>>>> b/drivers/infiniband/sw/siw/siw_cm.c > >>>>>>>>>>> index a7cde98e73e8..8dc8cea2566c 100644 > >>>>>>>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c > >>>>>>>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c > >>>>>>>>>>> @@ -559,13 +559,13 @@ static int siw_recv_mpa_rr(struct > >>>>>> siw_cep > >>>>>>>>> *cep) > >>>>>>>>>>> * A private data buffer gets allocated if > >>>> hdr->params.pd_len > >>>>>> != > >>>>>>>>> 0. > >>>>>>>>>>> */ > >>>>>>>>>>> if (!cep->mpa.pdata) { > >>>>>>>>>>> - cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL); > >>>>>>>>>>> + cep->mpa.pdata = kmalloc(pd_len, GFP_KERNEL); > >>>>>>>>>>> if (!cep->mpa.pdata) > >>>>>>>>>>> return -ENOMEM; > >>>>>>>>>>> } > >>>>>>>>>>> rcvd = ksock_recv( > >>>>>>>>>>> s, cep->mpa.pdata + cep->mpa.bytes_rcvd - sizeof(struct > >>>>>> mpa_rr), > >>>>>>>>>>> - to_rcv + 4, MSG_DONTWAIT); > >>>>>>>>>>> + to_rcv, MSG_DONTWAIT); > >>>>>>>>>>> > >>>>>>>>>>> if (rcvd < 0) > >>>>>>>>>>> return rcvd; > >>>>>>>>>>> -- > >>>>>>>>>>> 2.23.0.rc0 > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> The intention of this code is to make sure the > >>>>>>>>>> peer does not violates the MPA handshake rules. > >>>>>>>>>> The initiator MUST NOT send extra data after its > >>>>>>>>>> MPA request and before receiving the MPA reply. > >>>>>>>>> > >>>>>>>>> I think this is true only for MPA v2. With MPA v1, the > >>>>>>>>> initiator can begin sending immediately (before receiving > >>>>>>>>> the MPA reply), because there is no actual negotiation at > >>>>>>>>> the MPA layer. > >>>>>>>>> > >>>>>>>>> With MPA v2, the negotiation exchange is required because > >>>>>>>>> the type of the following message is predicated by the > >>>>>>>>> "Enhanced mode" a|b|c|d flags present in the first 32 bits > >>>>>>>>> of the private data buffer. > >>>>>>>>> > >>>>>>>>> So, it seems to me that additional logic is needed here to > >>>>>>>>> determine the MPA version, before sniffing additional octets > >>>>>>>> >from the incoming stream? > >>>>>>>>> > >>>>>>>>> Tom. > >>>>>>>>> > >>>>>>>> > >>>>>>>> There still is the marker negotiation taking place. > >>>>>>>> RFC 5044 says in section 7.1.2: > >>>>>>>> > >>>>>>>> "Note: Since the receiver's ability to deal with Markers is > >>>>>>>> unknown until the Request and Reply Frames have been > >>>>>>>> received, sending FPDUs before this occurs is not > >> possible." > >>>>>>>> > >>>>>>>> This section further discusses the responder's behavior, > >>>>>>>> where it MUST receive a first FPDU from the initiator > >>>>>>>> before sending its first FPDU: > >>>>>>>> > >>>>>>>> "4. MPA Responder mode implementations MUST receive and > >>>> validate > >>>>>> at > >>>>>>>> least one FPDU before sending any FPDUs or Markers." > >>>>>>>> > >>>>>>>> So it appears with MPA version 1, the current siw > >>>>>>>> code is correct. The initiator is entering FPDU mode > >>>>>>>> first, and only after receiving the MPA reply frame. > >>>>>>>> Only after the initiator sent a first FPDU, the responder > >>>>>>>> can start using the connection in FPDU mode. > >>>>>>>> Because of this somehow broken connection establishment > >>>>>>>> procedure (only the initiator can start sending data), a > >>>>>>>> later MPA version makes this first FPDU exchange explicit > >>>>>>>> and selectable (zero length READ/WRITE/Send). > >>>>>>> > >>>>>>> Yeah, I guess so. Because nobody ever actually implemented > >>>> markers, > >>>>>>> I think that they may more or less passively ignore this. But > >>>>>> you're > >>>>>>> currect that it's invalid protocol behavior. > >>>>>>> > >>>>>>> If your testing didn't uncover any issues with existing > >>>>>> implementations > >>>>>>> failing to connect with your strict checking, I'm ok with it. > >>>>>> Tom & Bernard, > >>>>>> Thanks for the insight on MPA negotiation. > >>>>>> > >>>>>> Could the below patch be considered as a proper fix? > >>>>>> > >>>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c > >>>>>> b/drivers/infiniband/sw/siw/siw_cm.c > >>>>>> index 9ce8a1b925d2..0aec1b5212f9 100644 > >>>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c > >>>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c > >>>>>> @@ -503,6 +503,7 @@ static int siw_recv_mpa_rr(struct siw_cep > >>>> *cep) > >>>>>> struct socket *s = cep->sock; > >>>>>> u16 pd_len; > >>>>>> int rcvd, to_rcv; > >>>>>> + int extra_data_check = 4; /* 4Bytes, for MPA rules > >>>> violation > >>>>>> checking */ > >>>>>> > >>>>>> if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) { > >>>>>> rcvd = ksock_recv(s, (char *)hdr + > >>>>>> cep->mpa.bytes_rcvd, > >>>>>> @@ -553,23 +554,37 @@ static int siw_recv_mpa_rr(struct siw_cep > >>>> *cep) > >>>>>> return -EPROTO; > >>>>>> } > >>>>>> > >>>>>> + /* > >>>>>> + * Peer must not send any extra data other than MPA > >>>> messages > >>>>>> until MPA > >>>>>> + * negotiation is completed, an exception is MPA V2 > >>>>>> client-server Mode, > >>>>>> + * IE, in this mode the peer after sending MPA Reply can > >>>>>> immediately > >>>>>> + * start sending data in RDMA mode. > >>>>>> + */ > >>>>> > >>>>> This is unfortunately not true. The responder NEVER sends > >>>>> an FPDU without having seen an FPDU from the initiator. > >>>>> I just checked RFC 6581 again. The RTR (ready-to-receive) > >>>>> indication from the initiator is still needed, but now > >>>>> provided by the protocol and not the application: w/o > >>>>> 'enhanced MPA setup', the initiator sends the first > >>>>> FPDU as an application message. With 'enhanced MPA setup', > >>>>> the initiator protocol entity sends (w/o application > >>>>> interaction) a zero length READ/WRITE/Send as a first FPDU, > >>>>> as previously negotiated with the responder. Again: only > >>>>> after the responder has seen the first FPDU, it can > >>>>> start sending in FPDU mode. > >>>> If I understand your statements correctly: in MPA v2 clint-server > >>>> mode, > >>>> the responder application should have some logic to wait(after > >>>> ESTABLISH > >>>> event) until the first FPDU message is received from the > >> initiator? > >>>> > >>>> Thanks, > >>>> Krishna. > >>> > >>> The responder may either delay the established event > >>> until it sees the first peer FPDU, or it delays sending > >>> the first (now already posted) FPDU, until it sees the > >>> first initiator FPDU. This, since (theoretically), the > >>> responder does not know yet the result of CRC and > >>> Marker negotiation. siw delays the established event, > >>> which seems the more clear path to me. > >>> > >>> This MPA handshake is cumbersome. siw is permissive > >>> for violations of these rules to the extend, that the > >>> responder shall not send back to back the first FPDU > >>> with the MPA reply. If a (Chelsio ?) iWarp adapter > >>> would send the first responder FPDU only after it > >>> got TCP acked its MPA reply, the chances are high the > >>> (siw) iWarp initiator has already transitioned to RTS > >>> and it would accept that FPU, even if not having sent > >>> it's first FPDU. > >>> > >>> siw sticks to this rule to that extend, since all receive > >>> processing is triggered by socket callbacks (from > >>> sk->sk_data_ready()). As long as the QP is not > >>> in RTS, all TCP bytestream processing is done by the > >>> MPA protocol (code in siw_cm.c). If the QP reaches > >>> RTS, RX processing is done bt the full iWarp > >>> protocol (code in siw_qp_rx.c for RX). > >>> Now, if the higher layer (application) protocol has > >>> a semantic where the responder sends the first message, > >>> we could end up in an infinite wait for that packet > >>> at initiator application side. > >>> This, since the complete first FPDU was already > >>> received at the TCP socket, while the QP was not in > >>> RTS. It will not trigger any additional sk->sk_data_ready() > >>> and we are stuck with a FPDU in the socket rx buffer. > >>> > >>> I implemented that 4 bytes extra data testing only to > >>> keep the siw protocol state machine as simple as > >>> possible (it is already unfortunately complex, if > >>> you count the lines in siw_cm.c), while following > >>> the protocol rules. > >>> > >>> I suggest to correctly implement peer2peer mode and > >>> delay the established event at the responder side until > >>> it got the negotiated first zero length FPDU. Out > >>> of the possible options, siw supports both zero length > >>> WRITE and READ, but no SEND, since this would consume > >>> an extra receive WQE. > >>> > >>> As a last resort, I might consider extending the > >>> siw state machine to handle those - non-complaint - > >>> cases gracefully. But, that would result in different > >>> code than just avoiding checking for peer protocol > >>> violation. > >> > >> Bernard, everything you describe in siw seems perfectly valid > >> to me, and in fact that's why the MPA enhanced connection mode > >> was designed the way it is. I disagree that it's "cumbersome", > >> but that's a nit. > >> > >> The issue is that the responder reaches RTS at a different time > >> than the initiator reaches RTR. The original iWARP connection > >> model did not make any requirement, and races were possible. > >> MPAv2 fixes those. > >> > > Tom, thanks, exactly. MPAv2 fixes it. And an implementation Thanks Bernard & Tom, Could you also please consider making MPA ehanced connection mode(with RTR handshake) as default for SIW, as MPA V2 peer- to-peer mode seems to be more promising that MPA V2 client-server Mode, wrt race conditions. Currently SIW has MPA V2 client-server Mode as default. In siw_main.c: const bool peer_to_peer = 1; Between, as per my observations, the chances of hitting this 'connect error' issue is higher with: SIW(initator)<--->(responder)SIW than: SIW(initator) <---> (responder)hard-iwarp > > of that fix MUST adhere to the defined rules. If one > > negotiates an extra handshake to synchronize, it > > MUST adhere to that handshake rules. There is no point in > > negotiating an extra handshake, and right away ignoring it. > > If the responder wants a zero length WRITE, it MUST use > > its reception to synchronize its transition to RTS. > > > > Shall we extend the siw state machine to support silly > > peer behavior? As said, if I read there is no way of > > Of course the right answer to this is "no", but I'd reserve > drawing that conclusion until I am certain it won't cause > trouble. If existing iWARP implementations don't enforce > this, and SIW becomes the odd duck failing connections, then > I'm afraid I would have to say "yes" instead. > > > making the peer implementation RFC compliant (it would > > be unexpected), I'll look into the applicability of > > Postel's Rule. But let's make those things explicit. > > Explicit is best, agreed. > > Tom. > > > > >> SIW is certainly within its rights to prevent peers from > >> sending prior to responder RTR. I'd suggest that if possible, > >> you try to detect the deadlock rather than flatly rejecting any > >> incoming bytes. The Internet Principle (be liberal in what you > >> accept...), after all. > >> > >> This same dance happens in IB/RoCE, btw. Just via different > >> messages. > >> > > Interesting! > > > > Thanks > > Bernard. > > > >> Tom. > >> > >> > >>> > >>> Thanks, > >>> Bernard. > >>> > >>> > >>>>> > >>>>> Sorry for the confusion. But the current > >>>>> siw code appears to be just correct. > >>>>> > >>>>> Thanks > >>>>> Bernard > >>>>> > >>>>> > >>>>> > >>>>>> + if ((__mpa_rr_revision(cep->mpa.hdr.params.bits) == > >>>>>> MPA_REVISION_2) && > >>>>>> + (cep->state == SIW_EPSTATE_AWAIT_MPAREP)) { > >>>>>> + int mpa_p2p_mode = cep->mpa.v2_ctrl_req.ord & > >>>>>> + (MPA_V2_RDMA_WRITE_RTR | > >>>>>> MPA_V2_RDMA_READ_RTR); > >>>>>> + if (!mpa_p2p_mode) > >>>>>> + extra_data_check = 0; > >>>>>> + } > >>>>>> + > >>>>>> /* > >>>>>> * At this point, we must have hdr->params.pd_len != 0. > >>>>>> * A private data buffer gets allocated if > >>>> hdr->params.pd_len > >>>>>> != > >>>>>> * 0. > >>>>>> */ > >>>>>> if (!cep->mpa.pdata) { > >>>>>> - cep->mpa.pdata = kmalloc(pd_len + 4, > >> GFP_KERNEL); > >>>>>> + cep->mpa.pdata = kmalloc(pd_len + > >>>> extra_data_check, > >>>>>> GFP_KERNEL); > >>>>>> if (!cep->mpa.pdata) > >>>>>> return -ENOMEM; > >>>>>> } > >>>>>> rcvd = ksock_recv( > >>>>>> s, cep->mpa.pdata + cep->mpa.bytes_rcvd - > >>>>>> sizeof(struct > >>>>>> mpa_rr), > >>>>>> - to_rcv + 4, MSG_DONTWAIT); > >>>>>> + to_rcv + extra_data_check, MSG_DONTWAIT); > >>>>>> > >>>>>> if (rcvd < 0) > >>>>>> return rcvd; > >>>>>> > >>>>>> - if (rcvd > to_rcv) > >>>>>> + if (extra_data_check && (rcvd > to_rcv)) > >>>>>> return -EPROTO; > >>>>>> > >>>>>> cep->mpa.bytes_rcvd += rcvd; > >>>>>> > >>>>>> -Krishna. > >>>>>>> > >>>>>>> Tom. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> Bernard. > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> So, for the MPA request case, this code is needed > >>>>>>>>>> to check for protocol correctness. > >>>>>>>>>> You are right for the MPA reply case - if we are > >>>>>>>>>> _not_ in peer2peer mode, the peer can immediately > >>>>>>>>>> start sending data in RDMA mode after its MPA Reply. > >>>>>>>>>> So we shall add appropriate code to be more specific > >>>>>>>>>> For an error, we are (1) processing an MPA Request, > >>>>>>>>>> OR (2) processing an MPA Reply AND we are not expected > >>>>>>>>>> to send an initial READ/WRITE/Send as negotiated with > >>>>>>>>>> the peer (peer2peer mode MPA handshake). > >>>>>>>>>> > >>>>>>>>>> Just removing this check would make siw more permissive, > >>>>>>>>>> but to a point where peer MPA protocol errors are > >>>>>>>>>> tolerated. I am not in favor of that level of > >>>>>>>>>> forgiveness. > >>>>>>>>>> > >>>>>>>>>> If possible, please provide an appropriate patch > >>>>>>>>>> or (if it causes current issues with another peer > >>>>>>>>>> iWarp implementation) just run in MPA peer2peer mode, > >>>>>>>>>> where the current check is appropriate. > >>>>>>>>>> Otherwise, I would provide an appropriate fix by Monday > >>>>>>>>>> (I am still out of office this week). > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Many thanks and best regards, > >>>>>>>>>> Bernard. > >>>>>>>>>> > >> > >> > > > > > >