On Fri, 2023-11-24 at 10:20 -0400, Jason Gunthorpe wrote: > On Fri, Nov 24, 2023 at 03:10:29PM +0100, Niklas Schnelle wrote: > > > What's the reasoning behind not using the existing memcpy_toio() > > here? > > Going forward CPUs are implementing an instruction to do a 64 byte > aligned store, this is a wrapper for exactly that operation. > > memcpy_toio() is much more general, it allows unaligned buffers and > non-multiples of 64. Adapting the general version to generate the > optimized version in the cases it can is complex and has a codegen > penalty.. I think you misunderstood me. I understand why you want a separate memcpy_toio_64(). I just wonder if its generic implementation shouldn't just be a define or inline wrapper for memcpy_toio(addr, buffer, 64). For s390 that would already result in a single PCI store block which for us is much much better than 8 consecutive __raw_writeq(). Our zpci_memcpy_toio() still has some extra code to ensure alignment and break it up in supported sizes that we could get rid of with our own memcpy_toio_64() of course. I suspect though that since it's all inline functions the compiler seeing the constant 64 might already eliminate some of the extra code. Also seeing the second patch of course that would no longer really test for write combining for us which we can also do but I think that's okay and you're probably going to use memcpy_toio_64() in more places and there we really want the PCI store block. Thanks, Niklas