Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64

Jason Gunthorpe <jgg@xxxxxxxxxx> · Fri, 24 Nov 2023 10:55:29 -0400

On Fri, Nov 24, 2023 at 03:48:22PM +0100, Niklas Schnelle wrote:
> On Fri, 2023-11-24 at 10:20 -0400, Jason Gunthorpe wrote:
> > On Fri, Nov 24, 2023 at 03:10:29PM +0100, Niklas Schnelle wrote:
> >  
> > > What's the reasoning behind not using the existing memcpy_toio()
> > > here?
> > 
> > Going forward CPUs are implementing an instruction to do a 64 byte
> > aligned store, this is a wrapper for exactly that operation.
> > 
> > memcpy_toio() is much more general, it allows unaligned buffers and
> > non-multiples of 64. Adapting the general version to generate the
> > optimized version in the cases it can is complex and has a codegen
> > penalty..
> 
> I think you misunderstood me. I understand why you want a separate
> memcpy_toio_64(). I just wonder if its generic implementation shouldn't
> just be a define or inline wrapper for memcpy_toio(addr, buffer, 64).

Oh, yes, I totally did.

I'm worried that x86 will less reliably generate write combining with
it's memcpy_toio implemention. It codegens byte copies for that
function :(

> Also seeing the second patch of course that would no longer really test
> for write combining for us which we can also do but I think that's okay
> and you're probably going to use memcpy_toio_64() in more places and
> there we really want the PCI store block.

Right now we don't have in-kernel performance use cases for write
combining for mlx5.

Userspace uses the WC and we already have the special 390 instructions
for batching in rdma-core already, IIRC.

So it would be appropriate for s390 to use a consistent path.

Jason