Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/12/2021 2:32 PM, Haakon Bugge wrote:


On 10 Apr 2021, at 15:30, David Laight <David.Laight@xxxxxxxxxx> wrote:

From: Tom Talpey
Sent: 09 April 2021 18:49
On 4/9/2021 12:27 PM, Haakon Bugge wrote:


On 9 Apr 2021, at 17:32, Tom Talpey <tom@xxxxxxxxxx> wrote:

On 4/9/2021 10:45 AM, Chuck Lever III wrote:
On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@xxxxxxxxxx> wrote:

On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:

We need to get a better idea what correctness testing has been done,
and whether positive correctness testing results can be replicated
on a variety of platforms.
RO has been rolling out slowly on mlx5 over a few years and storage
ULPs are the last to change. eg the mlx5 ethernet driver has had RO
turned on for a long time, userspace HPC applications have been using
it for a while now too.

I'd love to see RO be used more, it was always something the RDMA
specs supported and carefully architected for. My only concern is
that it's difficult to get right, especially when the platforms
have been running strictly-ordered for so long. The ULPs need
testing, and a lot of it.

We know there are platforms with broken RO implementations (like
Haswell) but the kernel is supposed to globally turn off RO on all
those cases. I'd be a bit surprised if we discover any more from this
series.
On the other hand there are platforms that get huge speed ups from
turning this on, AMD is one example, there are a bunch in the ARM
world too.

My belief is that the biggest risk is from situations where completions
are batched, and therefore polling is used to detect them without
interrupts (which explicitly). The RO pipeline will completely reorder
DMA writes, and consumers which infer ordering from memory contents may
break. This can even apply within the provider code, which may attempt
to poll WR and CQ structures, and be tripped up.
You are referring specifically to RPC/RDMA depending on Receive
completions to guarantee that previous RDMA Writes have been
retired? Or is there a particular implementation practice in
the Linux RPC/RDMA code that worries you?

Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
is hopefully unfounded, is that the RO pipeline might not have flushed
when a completion is posted *after* posting an interrupt.

Something like this...

RDMA Write arrives
	PCIe RO Write for data
	PCIe RO Write for data
	...
RDMA Write arrives
	PCIe RO Write for data
	...
RDMA Send arrives
	PCIe RO Write for receive data
	PCIe RO Write for receive descriptor

Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then
it will shure prior written RO date has global visibility when the CQE can be observed.

I wasn't aware that a strongly-ordered PCIe Write will ensure that
prior relaxed-ordered writes went first. If that's the case, I'm
fine with it - as long as the providers are correctly coded!!

The PCIe spec (Table Ordering Rules Summary) is quite clear here (A Posted request is Memory Write Request in this context):

	A Posted Request must not pass another Posted Request unless A2b applies.

	A2b: A Posted Request with RO Set is permitted to pass another Posted Request.


Thxs, Håkon

Ok, good - a non-RO write (for example, to a CQE), or an interrupt
(which would be similarly non-RO), will "get behind" all prior writes.

So the issue is only in testing all the providers and platforms,
to be sure this new behavior isn't tickling anything that went
unnoticed all along, because no RDMA provider ever issued RO.

Honestly, the Haswell sounds like a great first candidate, because
if it has a known-broken RO behavior, verifying that it works with
this change is highly important. I'd have greater confidence in newer
platforms, in other words. They *all* have to work, proveably.

Tom.

I remember trying to read the relevant section of the PCIe spec.
(Possibly in a book that was trying to make it easier to understand!)
It is about as clear as mud.

I presume this is all about allowing PCIe targets (eg ethernet cards)
to use relaxed ordering on write requests to host memory.
And that such writes can be completed out of order?

It isn't entirely clear that you aren't talking of letting the
cpu do 'relaxed order' writes to PCIe targets!

For a typical ethernet driver the receive interrupt just means
'go and look at the receive descriptor ring'.
So there is an absolute requirement that the writes for data
buffer complete before the write to the receive descriptor.
There is no requirement for the interrupt (requested after the
descriptor write) to have been seen by the cpu.

Quite often the driver will find the 'receive complete'
descriptor when processing frames from an earlier interrupt
(and nothing to do in response to the interrupt itself).

So the write to the receive descriptor would have to have RO clear
to ensure that all the buffer writes complete first.

(The furthest I've got into PCIe internals was fixing the bug
in some vendor-supplied FPGA logic that failed to correctly
handle multiple data TLP responses to a single read TLP.
Fortunately it wasn't in the hard-IP bit.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux