On Fri, Nov 24, 2023 at 03:48:22PM +0100, Niklas Schnelle wrote: > On Fri, 2023-11-24 at 10:20 -0400, Jason Gunthorpe wrote: > > On Fri, Nov 24, 2023 at 03:10:29PM +0100, Niklas Schnelle wrote: > > > > > What's the reasoning behind not using the existing memcpy_toio() > > > here? > > > > Going forward CPUs are implementing an instruction to do a 64 byte > > aligned store, this is a wrapper for exactly that operation. > > > > memcpy_toio() is much more general, it allows unaligned buffers and > > non-multiples of 64. Adapting the general version to generate the > > optimized version in the cases it can is complex and has a codegen > > penalty.. > > I think you misunderstood me. I understand why you want a separate > memcpy_toio_64(). I just wonder if its generic implementation shouldn't > just be a define or inline wrapper for memcpy_toio(addr, buffer, 64). Oh, yes, I totally did. I'm worried that x86 will less reliably generate write combining with it's memcpy_toio implemention. It codegens byte copies for that function :( > Also seeing the second patch of course that would no longer really test > for write combining for us which we can also do but I think that's okay > and you're probably going to use memcpy_toio_64() in more places and > there we really want the PCI store block. Right now we don't have in-kernel performance use cases for write combining for mlx5. Userspace uses the WC and we already have the special 390 instructions for batching in rdma-core already, IIRC. So it would be appropriate for s390 to use a consistent path. Jason