Re: [PATCH rdma-next 0/3] Support out of order data placement

Tom Talpey <tom@xxxxxxxxxx> · Mon, 12 Jun 2017 21:30:05 -0400

On 6/12/2017 8:36 PM, Parav Pandit wrote:
-----Original Message-----
From: Tom Talpey [mailto:tom@xxxxxxxxxx]
Sent: Monday, June 12, 2017 7:12 PM
To: Parav Pandit <parav@xxxxxxxxxxxx>; Jason Gunthorpe
<jgunthorpe@xxxxxxxxxxxxxxxxxxxx>
Cc: Bart Van Assche <Bart.VanAssche@xxxxxxxxxxx>; leon@xxxxxxxxxx;
dledford@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; Idan Burstein
<idanb@xxxxxxxxxxxx>
Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement

On 6/12/2017 7:59 PM, Parav Pandit wrote:
-----Original Message-----
From: Tom Talpey [mailto:tom@xxxxxxxxxx]
Sent: Monday, June 12, 2017 6:44 PM
To: Parav Pandit <parav@xxxxxxxxxxxx>; Jason Gunthorpe
<jgunthorpe@xxxxxxxxxxxxxxxxxxxx>
Cc: Bart Van Assche <Bart.VanAssche@xxxxxxxxxxx>; leon@xxxxxxxxxx;
dledford@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; Idan Burstein
<idanb@xxxxxxxxxxxx>
Subject: Re: [PATCH rdma-next 0/3] Support out of order data
placement

On 6/12/2017 6:54 PM, Parav Pandit wrote:
Hi Tom,

-----Original Message-----
From: Tom Talpey [mailto:tom@xxxxxxxxxx]
Sent: Monday, June 12, 2017 5:20 PM
To: Parav Pandit <parav@xxxxxxxxxxxx>; Jason Gunthorpe
<jgunthorpe@xxxxxxxxxxxxxxxxxxxx>
Cc: Bart Van Assche <Bart.VanAssche@xxxxxxxxxxx>;
leon@xxxxxxxxxx;
dledford@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; Idan Burstein
<idanb@xxxxxxxxxxxx>
Subject: Re: [PATCH rdma-next 0/3] Support out of order data
placement

On 6/12/2017 5:32 PM, Parav Pandit wrote:
Hi Tom,
...

I agree with Jason, the bit should be 1 by default, if defined as
you
propose.
Out-of-order is the norm, not the exception, for ULPs.
Honestly, I think you should perhaps consider making it the
default on your devices, and allowing only MLX-aware ULPs to turn
it off.

There can be cases in deployment where responder has support for
receiving out-of-order, but requester doesn't.

Yuck! So this needs to be negotiated end-to-end, and by the upper
layer?
Talk about barriers to adoption, and opportunities for disaster.

As Jason confirmed that all Linux kernel consumers are coded to be
compliant to o9-20 requirement, So I think kernel based rdma-cm
consumers can be transparently enabled end-to-end without ULP's
involvement with rdma_accept() and rdma_connect().

I have two thoughts here.

1) You seem to assume all consumers are Linux, and do not need to
negotiate. This is a dangerous assumption.
Certainly not. I didn't assume that. I just gave one example that known
consumers can be done without modifying the ULP.
Explained further in 3rd question.
Even other consumers can work with this solution.
For example Linux rdmacm based client and Other OS based server.
Client is ooo capable.
Server is ooo not capable.
Once you follow below rdmacm based sequence, it will be clear how this
will works.

Oh, so there's a MAD protocol change under the hood.
No. There is no change under the hood.
Your question was how can we avoid ULP change and still they can benefit of this feature?
So I said rdmacm based Linux kernel consumers that we know of comply to o9-20, can take the benefit once rdmacm is extended as below example.

Well, that's a wider
question. And I still don't understand how existing, non-strict-requiring
protocols can take advantage of this.
Nor how this works for non-Mellanox, non-IB/RoCE implementations.

Device capability indicates that which device supports this. Explained in Documentation/out_of_order.txt usage section.
So whichever vendor supports it, whichever protocol supports it, can set this optional device capability.

Again, I'd be a lot less concerned if non-strict were the default, and strict
mode was negotiated. It's all just so upside-down.

In IB spec, in-order delivery is default.

I don't agree. Requests are sent in-order, and the responder
processes them in-order, but the bytes thenselves are not
guaranteed to appear in-order. Additionally, if retries occur,
this is most definitely not the case.

Section 9.5 Transaction Ordering, I believe, covers these
requirements. Can you tell me where I misunderstand them?
In fact, c9-28 explicitly warns:

  • An application shall not depend upon the order of data writes to
  memory within a message. For example, if an application sets up
  data buffers that overlap, for separate data segments within a
  message, it is not guaranteed that the last sent data will always
  overwrite the earlier.

My guess is that this bit overrides the MLX behavior of
never pipelining RDMA Write requests, allowing more packets
to be queued at the responder and making better use of the
network. This is not at all prohibited by the spec, nor is
it unexpected by properly-coded upper layers, which all the
kernel consumers are.

I have one other question on the Documentation out-of-order.txt.
It states the fence bit can be used to force ordering on a
non-strict connection. But fence doesn't apply to RDMA Write?
It only applies to operations which produce a reply, such as
RDMA Read or Atomic. Have you changed the semantic?

Tom.

So can you suggest how can we change default IB behavior without 
breaking anything?
Adding optional attribute seems the right way that ensures compatibility.

Tom.

2) I assume that there is some performance benefit to toggling this
setting to non-strict. So, how do existing consumers get this
advantage, especially since they don't need strict semantics? Bearing
in mind that they do have to negotiate this end-to-end, meaning they
require a protocol extension.
I don't have completely transparent upstream solution for existing
consumers yet.

Actually. I have a third thought. Since this is an attribute to qp
creation, performed even before establishing a connection, how does
the upper layer know when to set it?
This is not at QP creation time. I have described in
Documentation/out_of_order.txt in usage section 3.
This is at QP state transition from INIT to RTR.
Here is the flow. It's just not coded enough for posting patches.

1. When rdmacm active side creates the QP, It is INIT state.
2. Send MAD_Req msg (indicating ooo_requested=1) 3. When rdmacm
passive side receives the message, it looks up device_cap attribute and
matches it against ooo_requested flag.
4. when device supports it, MAD_Rsp msg sets ooo_enabled=1, if it
doesn't support it, ooo_enabled=0 5. rdmacm passive side creates the QP
and moves to RTR state (with QP ooo enabled bit set).
6. active side receives the message and puts the QP to RTR, RTS state
based on received bit setting from passive side.

Flow is no different than how rest of the connection specific parameters
are shared such as IRD/ORD, PSN, timeouts, mtu etc.

Tom.
N     r  y   b X  ǧv ^ )޺{.n +    {  ٚ {ay ʇڙ ,j   f   h   z  w

    j:+v   w j m         zZ+     ݢj"  !tml=

N�����r��y���b�X��ǧv�^�)޺{.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+�����ݢj"��!tml=

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html