RE: [PATCH rdma-next 0/3] Support out of order data placement

Parav Pandit <parav@xxxxxxxxxxxx> · Mon, 12 Jun 2017 21:53:55 +0000

> -----Original Message-----
> From: Steve Wise [mailto:swise@xxxxxxxxxxxxxxxxxxxxx]
> Sent: Monday, June 12, 2017 4:19 PM
> To: 'Jason Gunthorpe' <jgunthorpe@xxxxxxxxxxxxxxxxxxxx>
> Cc: 'Hefty, Sean' <sean.hefty@xxxxxxxxx>; 'Dalessandro, Dennis'
> <dennis.dalessandro@xxxxxxxxx>; Parav Pandit <parav@xxxxxxxxxxxx>;
> 'Leon Romanovsky' <leon@xxxxxxxxxx>; 'Doug Ledford'
> <dledford@xxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx; Idan Burstein
> <idanb@xxxxxxxxxxxx>
> Subject: RE: [PATCH rdma-next 0/3] Support out of order data placement
> 
> > On Mon, Jun 12, 2017 at 03:57:29PM -0500, Steve Wise wrote:
> > > > > > When transmitter and receiver is enabled to do so, as I
> > > > > > described in
> > > > > overview section of Documentation, it helps
> > > > > > (a) to avoid retransmission - improves network utilization
> > > > > > (b) reduces latency due to timers not kicking in.
> > > > >
> > > > > Yes those benefits are clear. I see no reason why it shouldn't
> > > > > always be done is my point. Application shouldn't have to care
> > > > > and there is no need to make this an additional flag.
> > > >
> > > > The app cares when data from write 2 can be written at the target
> > > > before
> data
> > > > from write 1, especially if the writes target the same memory buffers.
> (At least
> > I
> > > > think this is the intent of exposing this to the app.)
> > > >
> > > > Note that the provider can always provide stronger ordering than
> > > > what the
> app
> > > > needs.
> > >
> > > My understanding is that IB or IW apps should never assume ingress
> > > write or read response data is _placed_ into local memory in the
> > > order it was transmitted from the peer.  The only guarantee is that
> > > the _indication_ of the arrived data preserve the sender's ordering.
> > > However, I'm thinking that there are applications out there that
> > > spin polling local memory that is the target of a write or read
> > > response and assume the last bit of that memory will get written
> > > last...
> >
> > That is with respect to the CPU, but IB requires strong ordering
> > between messages within the same QP, eg if I do
> >
> > RDMA WRITE addr=0 data=1
> > RDMA WRITE addr=0 data=2
> > RDMA WRITE addr=0 data=3
> > RDMA READ  addr=0
> >
> > I must always get 3, not something else.
> 
> Correct, but the peer, ie the remote end that is the target of those writes,
> can spin looking at local address 0 and it might see 1, 2, or 3.  Eventually it
> will see 3.  But there is no guarantee that it will see 1 before 2 or even see 1
> or 2
> at all depending on timing.
> 
> But what I was getting at is this:  Say you tell your peer to RDMA WRITE
> 16KB into your local buffer.  And let us say the last bit of that 16KB data will
> be a 1, and that the current value of that bit location in the local buffer is 0.
> It is incorrect for the app to spin reading that bit until reads 1, and assume
> the data prior to that bit has been placed at that point.  At least with the
> iWARP spec, out of order placement is allowed.  So if the 16KB was broken
> into X iWARP DDP segments, the last segment could have been placed
> before the other segments.
> A correct application will require the peer to post a SEND after the WRITE or
> WRITE_WITH_IMMEDIATE, and only know the data has been placed into the
> local buffer when it polls the recv completion for the SEND or
> WRITE_W_IMMD.  An iWARP RNIC _must_ guarantee in-order delivery of
> data (but not actual placement).  Am I making sense?
> 
> I'm guessing no HCAs nor RNICs actually place data out of order.  cxgb*
> does not.  So applications _might_ be doing the spin technique I described.
> I recall a long time ago that MVAPICH2 did this.  Not sure if it still does.
> 
> >
> > It would be notable if this 'out of order' feature violated that
> > invariant, but many ULPs would probably still be OK.
> >
I certainly don't see a point in breaking the users who are polling on data, even though they should have followed optional requirement o9-20.
Also read responses can come out of orders, if such messages are polled either, it would also break, not just writes.
Refer table 79 for two read operations.

> > Frankly, Parav's original message doesn't seem to describe at all what
> > this is about, so maybe we should all wait until v2, and maybe more
> > people from Mellanox could contribute to sensibly describing it if
> > they want it in ibverbs.

I will add following more details to Documentation.
1. Mention about pcie relax ordering - Barts point
2. Include responder side table like Table 79 to crisply describe all cases and ordering with respect to send and other messages
3. Also indicate that C9-28 is relaxed when ooo is enabled on a QP as a description to new responder side table.
This was offline comment that I received.
4. Provide examples that Steve and Jason highlighted, with multiple writes to same memory location.
5. Reiterate table 79, to make it clear that what doesn't change, or changes.
Let me know if you want to see any more details.

> >
> > Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html