> On 10 Apr 2021, at 15:30, David Laight <David.Laight@xxxxxxxxxx> wrote: > > From: Tom Talpey >> Sent: 09 April 2021 18:49 >> On 4/9/2021 12:27 PM, Haakon Bugge wrote: >>> >>> >>>> On 9 Apr 2021, at 17:32, Tom Talpey <tom@xxxxxxxxxx> wrote: >>>> >>>> On 4/9/2021 10:45 AM, Chuck Lever III wrote: >>>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@xxxxxxxxxx> wrote: >>>>>> >>>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote: >>>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: >>>>>>> >>>>>>>> We need to get a better idea what correctness testing has been done, >>>>>>>> and whether positive correctness testing results can be replicated >>>>>>>> on a variety of platforms. >>>>>>> RO has been rolling out slowly on mlx5 over a few years and storage >>>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO >>>>>>> turned on for a long time, userspace HPC applications have been using >>>>>>> it for a while now too. >>>>>> >>>>>> I'd love to see RO be used more, it was always something the RDMA >>>>>> specs supported and carefully architected for. My only concern is >>>>>> that it's difficult to get right, especially when the platforms >>>>>> have been running strictly-ordered for so long. The ULPs need >>>>>> testing, and a lot of it. >>>>>> >>>>>>> We know there are platforms with broken RO implementations (like >>>>>>> Haswell) but the kernel is supposed to globally turn off RO on all >>>>>>> those cases. I'd be a bit surprised if we discover any more from this >>>>>>> series. >>>>>>> On the other hand there are platforms that get huge speed ups from >>>>>>> turning this on, AMD is one example, there are a bunch in the ARM >>>>>>> world too. >>>>>> >>>>>> My belief is that the biggest risk is from situations where completions >>>>>> are batched, and therefore polling is used to detect them without >>>>>> interrupts (which explicitly). The RO pipeline will completely reorder >>>>>> DMA writes, and consumers which infer ordering from memory contents may >>>>>> break. This can even apply within the provider code, which may attempt >>>>>> to poll WR and CQ structures, and be tripped up. >>>>> You are referring specifically to RPC/RDMA depending on Receive >>>>> completions to guarantee that previous RDMA Writes have been >>>>> retired? Or is there a particular implementation practice in >>>>> the Linux RPC/RDMA code that worries you? >>>> >>>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which >>>> is hopefully unfounded, is that the RO pipeline might not have flushed >>>> when a completion is posted *after* posting an interrupt. >>>> >>>> Something like this... >>>> >>>> RDMA Write arrives >>>> PCIe RO Write for data >>>> PCIe RO Write for data >>>> ... >>>> RDMA Write arrives >>>> PCIe RO Write for data >>>> ... >>>> RDMA Send arrives >>>> PCIe RO Write for receive data >>>> PCIe RO Write for receive descriptor >>> >>> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then >> it will shure prior written RO date has global visibility when the CQE can be observed. >> >> I wasn't aware that a strongly-ordered PCIe Write will ensure that >> prior relaxed-ordered writes went first. If that's the case, I'm >> fine with it - as long as the providers are correctly coded!! The PCIe spec (Table Ordering Rules Summary) is quite clear here (A Posted request is Memory Write Request in this context): A Posted Request must not pass another Posted Request unless A2b applies. A2b: A Posted Request with RO Set is permitted to pass another Posted Request. Thxs, Håkon > > I remember trying to read the relevant section of the PCIe spec. > (Possibly in a book that was trying to make it easier to understand!) > It is about as clear as mud. > > I presume this is all about allowing PCIe targets (eg ethernet cards) > to use relaxed ordering on write requests to host memory. > And that such writes can be completed out of order? > > It isn't entirely clear that you aren't talking of letting the > cpu do 'relaxed order' writes to PCIe targets! > > For a typical ethernet driver the receive interrupt just means > 'go and look at the receive descriptor ring'. > So there is an absolute requirement that the writes for data > buffer complete before the write to the receive descriptor. > There is no requirement for the interrupt (requested after the > descriptor write) to have been seen by the cpu. > > Quite often the driver will find the 'receive complete' > descriptor when processing frames from an earlier interrupt > (and nothing to do in response to the interrupt itself). > > So the write to the receive descriptor would have to have RO clear > to ensure that all the buffer writes complete first. > > (The furthest I've got into PCIe internals was fixing the bug > in some vendor-supplied FPGA logic that failed to correctly > handle multiple data TLP responses to a single read TLP. > Fortunately it wasn't in the hard-IP bit.) > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)