RE: [PATCH rdma-next 0/3] Support out of order data placement

Parav Pandit <parav@xxxxxxxxxxxx> · Sat, 22 Jul 2017 04:50:06 +0000

Hi Tom,

> -----Original Message-----
> From: Tom Talpey [mailto:tom@xxxxxxxxxx]
> Sent: Friday, July 21, 2017 9:29 PM
> To: Parav Pandit <parav@xxxxxxxxxxxx>; Jason Gunthorpe
> <jgunthorpe@xxxxxxxxxxxxxxxxxxxx>
> Cc: Bart Van Assche <Bart.VanAssche@xxxxxxxxxxx>; leon@xxxxxxxxxx;
> dledford@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; Idan Burstein
> <idanb@xxxxxxxxxxxx>
> Subject: Re: [PATCH rdma-next 0/3] Support out of order data placement
> 
> On 7/18/2017 10:33 PM, Parav Pandit wrote:
> > Hi Tom, Jason,
> >
> > Sorry for the late response.
> > Please find the response inline below.
> >
> >> -----Original Message-----
> >> From: Tom Talpey [mailto:tom@xxxxxxxxxx]
> >> Sent: Monday, June 12, 2017 8:30 PM
> >> To: Parav Pandit <parav@xxxxxxxxxxxx>; Jason Gunthorpe
> >> <jgunthorpe@xxxxxxxxxxxxxxxxxxxx>
> >> Cc: Bart Van Assche <Bart.VanAssche@xxxxxxxxxxx>; leon@xxxxxxxxxx;
> >> dledford@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; Idan Burstein
> >> <idanb@xxxxxxxxxxxx>
> >> Subject: Re: [PATCH rdma-next 0/3] Support out of order data
> >> placement
> >>
> >>>
> >>> In IB spec, in-order delivery is default.
> >>
> >> I don't agree. Requests are sent in-order, and the responder
> >> processes them in- order, but the bytes thenselves are not guaranteed to
> appear in-order.
> >> Additionally, if retries occur, this is most definitely not the case.
> >>
> >> Section 9.5 Transaction Ordering, I believe, covers these
> >> requirements. Can you tell me where I misunderstand them?
> >> In fact, c9-28 explicitly warns:
> >>
> >>     • An application shall not depend upon the order of data writes to
> >>     memory within a message. For example, if an application sets up
> >>     data buffers that overlap, for separate data segments within a
> >>     message, it is not guaranteed that the last sent data will always
> >>     overwrite the earlier.
> >>
> > The IB spec indeed does not imply any ordering in the placement of data into
> memory within a single message.
> >
> > It does guarantee that writes don't bypass writes and reads don't bypass reads
> (Table 76), and transport operations are executed in their *message* order (C9-
> 28):
> > "A responder shall execute SEND requests, RDMA WRITE requests and
> > ATOMIC Operation requests in the message order in which they are
> > received."
> >
> > Thus, ordering between messages is guaranteed - changes to remote memory
> of an RDMA-W will be observed strictly after any changes done by a previous
> RDMA-W; changes to local memory of an RDMA-R response will be observed
> strictly after any changes done by a previous RDMA-R response.
> >
> > The proposed feature in this patch set is to relax the memory placement
> ordering *across* messages and not within a single message (which is not
> mandated by the spec as u noted), such that multiple consecutive RDMA-Ws
> may be committed to memory in any order, and similarly for RDMA-R responses.
> > This changes application semantics whenever multiple-inflight RDMA
> operations write to overlapping locations, or when one operation indicates the
> completion of the other.
> > A simple example to clarify: a requestor posted the following work elements in
> the written order:
> > 1. RDMA-W(VA=0x1000, value=0x1)
> > 2. RDMA-W(VA=0x1000, value=0x2)
> > 3. Send()
> > On responder side, following the Send() operation completion, and according
> to spec (C9-28), reading from VA=0x1000 will produce the value 2. With the
> proposed feature enabled, the read value is not deterministic and dependent on
> the order in which the RDMA-W operations were received.
> >
> > The proposed QP flag allows applications to knowingly indicate this relaxed
> data placement, thereby enabling the HCA to place OOO RDMA messages into
> memory without buffering them.
> 
> You didn't answer my question what is the actual benefit of relaxing the
> ordering. Is it performance?

Yes. Performance is better.

> And, specifically what applications *can't* use it?
Applications which poll on RDMA-W data at responder side or RDMA-R data on Read requester side, cannot use this.
Because as explained in above example 2nd RDMA-W message can be executed first at responder side.
We cannot break such user space applications deployed in field by enabling this by default and without negotiation with peer.
> 
> To me, it appears that most storage upper layers can already use the extension.

Yes. As they don't poll on data and they depend on incoming RDMA Send they can make use of it.

> If it performs better, I expect they will definitely want to enable it. In that case I
> believe it should be the *default*, not an opt-in that these upper layers are
> newly responsible for.

Verb layer is unware of caller ULPs. At most it knows that its kernel vs user ULP easily - which is good enough.
Verb layer also doesn't know whether remote side support it or not.
Once rdmacm extension is done, all kernel ULPs which uses rdmacm - can be enabled by default.
This patchset enables user space applications to take immediate benefit of it which doesn't depend on rdmacm.

> 
> >> I have one other question on the Documentation out-of-order.txt.
> >> It states the fence bit can be used to force ordering on a non-strict
> connection.
> >> But fence doesn't apply to RDMA Write?
> >> It only applies to operations which produce a reply, such as RDMA
> >> Read or Atomic. Have you changed the semantic?
> >>
> > RDMA-R followed by RDMA-R semantic is changed when proposed QP flag is
> set.
> 
> Can you explain that statement in more detail please? Also, please clarify on
> what operation(s) the fence bit now applies.
Sure.
Let's take example.
A requestor posted the following work elements in the written order:
1. RDMA-R(VA=0x1000, len=0x400)
2. RDMA-R(VA=0x1400, len=0x4)
Currently as per Table-76, RDMA-R read response data of 1st RDMA-R is placed first.
With relaxed data placement attribute, 4 bytes data of 2nd RDMA-R can be placed first.
If user needs a ordering of current Table-76, it needs to set fence on 2nd RDMA-R.
This will ensure that 1st RDMA-R is executed before 2nd RDMA-R.

This translates to In Table 76, 
RDMA-R (Row) and RDMA-R(Column) changes from '#' to 'F'.
��.n��������+%������w��{.n�����{���fk��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f