RE: [PATCH rdma-next 0/3] Support out of order data placement

"Steve Wise" <swise@xxxxxxxxxxxxxxxxxxxxx> · Mon, 12 Jun 2017 16:18:34 -0500

> On Mon, Jun 12, 2017 at 03:57:29PM -0500, Steve Wise wrote:
> > > > > When transmitter and receiver is enabled to do so, as I described in
> > > > overview section of Documentation, it helps
> > > > > (a) to avoid retransmission - improves network utilization
> > > > > (b) reduces latency due to timers not kicking in.
> > > >
> > > > Yes those benefits are clear. I see no reason why it shouldn't always
> > > > be
> > > > done is my point. Application shouldn't have to care and there is no
> > > > need to make this an additional flag.
> > >
> > > The app cares when data from write 2 can be written at the target before
data
> > > from write 1, especially if the writes target the same memory buffers.
(At least
> I
> > > think this is the intent of exposing this to the app.)
> > >
> > > Note that the provider can always provide stronger ordering than what the
app
> > > needs.
> >
> > My understanding is that IB or IW apps should never assume ingress
> > write or read response data is _placed_ into local memory in the
> > order it was transmitted from the peer.  The only guarantee is that
> > the _indication_ of the arrived data preserve the sender's ordering.
> > However, I'm thinking that there are applications out there that
> > spin polling local memory that is the target of a write or read
> > response and assume the last bit of that memory will get written
> > last...
> 
> That is with respect to the CPU, but IB requires strong ordering
> between messages within the same QP, eg if I do
> 
> RDMA WRITE addr=0 data=1
> RDMA WRITE addr=0 data=2
> RDMA WRITE addr=0 data=3
> RDMA READ  addr=0
> 
> I must always get 3, not something else.

Correct, but the peer, ie the remote end that is the target of those writes, can
spin looking at local address 0 and it might see 1, 2, or 3.  Eventually it will
see 3.  But there is no guarantee that it will see 1 before 2 or even see 1 or 2
at all depending on timing.   

But what I was getting at is this:  Say you tell your peer to RDMA WRITE 16KB
into your local buffer.  And let us say the last bit of that 16KB data will be a
1, and that the current value of that bit location in the local buffer is 0.  It
is incorrect for the app to spin reading that bit until reads 1, and assume the
data prior to that bit has been placed at that point.  At least with the iWARP
spec, out of order placement is allowed.  So if the 16KB was broken into X iWARP
DDP segments, the last segment could have been placed before the other segments.
A correct application will require the peer to post a SEND after the WRITE or
WRITE_WITH_IMMEDIATE, and only know the data has been placed into the local
buffer when it polls the recv completion for the SEND or WRITE_W_IMMD.  An iWARP
RNIC _must_ guarantee in-order delivery of data (but not actual placement).  Am
I making sense?

I'm guessing no HCAs nor RNICs actually place data out of order.  cxgb* does
not.  So applications _might_ be doing the spin technique I described.  I recall
a long time ago that MVAPICH2 did this.  Not sure if it still does.

> 
> It would be notable if this 'out of order' feature violated that
> invariant, but many ULPs would probably still be OK.
> 
> Frankly, Parav's original message doesn't seem to describe at all what
> this is about, so maybe we should all wait until v2, and maybe more
> people from Mellanox could contribute to sensibly describing it if
> they want it in ibverbs.
> 
> Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html