On Fri, 2023-11-24 at 10:55 -0400, Jason Gunthorpe wrote: > On Fri, Nov 24, 2023 at 03:48:22PM +0100, Niklas Schnelle wrote: > > On Fri, 2023-11-24 at 10:20 -0400, Jason Gunthorpe wrote: > > > On Fri, Nov 24, 2023 at 03:10:29PM +0100, Niklas Schnelle wrote: > > > > > > > What's the reasoning behind not using the existing memcpy_toio() > > > > here? > > > > > > Going forward CPUs are implementing an instruction to do a 64 byte > > > aligned store, this is a wrapper for exactly that operation. > > > > > > memcpy_toio() is much more general, it allows unaligned buffers and > > > non-multiples of 64. Adapting the general version to generate the > > > optimized version in the cases it can is complex and has a codegen > > > penalty.. > > > > I think you misunderstood me. I understand why you want a separate > > memcpy_toio_64(). I just wonder if its generic implementation shouldn't > > just be a define or inline wrapper for memcpy_toio(addr, buffer, 64). > > Oh, yes, I totally did. > > I'm worried that x86 will less reliably generate write combining with > it's memcpy_toio implemention. It codegens byte copies for that > function :( Oh ok I see what you mean. > > > Also seeing the second patch of course that would no longer really test > > for write combining for us which we can also do but I think that's okay > > and you're probably going to use memcpy_toio_64() in more places and > > there we really want the PCI store block. > > Right now we don't have in-kernel performance use cases for write > combining for mlx5. Is the code in patch 2 performance critical? > > Userspace uses the WC and we already have the special 390 instructions > for batching in rdma-core already, IIRC. Yes, I added that support to rdma-core :-) > > So it would be appropriate for s390 to use a consistent path. > > Jason This should be as easy as adding #define memcpy_toio_64(to, from) zpci_memcpy_toio(to, from, 64) to arch/s390/include/asm/io.h. I'm wondering if we should do that as part of this series. It's not as good as a special case but probably better than the existing loop. I don't think we have any existing in-kernel users of memcpy_toio() on s390 so far though so I'd like to give this some extra testing. Could you share instructions on how to exercise the code path of patch 2 on a ConnectX-5 or 6? Is this exercised e.g. when using NVMe-oF RDMA? Thanks, Niklas