Re: [PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64

Jason Gunthorpe <jgg@xxxxxxxxxx> · Mon, 4 Dec 2023 14:23:30 -0400

On Mon, Dec 04, 2023 at 05:31:47PM +0000, Catalin Marinas wrote:
> On Mon, Nov 27, 2023 at 09:45:05AM -0400, Jason Gunthorpe wrote:
> > On Mon, Nov 27, 2023 at 12:42:41PM +0000, Catalin Marinas wrote:
> > > > > What's the actual requirement here? Is this just for performance?
> > > > 
> > > > Yes, just performance.
> > > 
> > > Do you have any rough numbers (percentage)? It's highly
> > > microarchitecture-dependent until we get the ST64B instruction.
> > 
> > The current C code is an open coded store loop. The kernel does 250
> > tries and measures if any one of them succeeds to combine.
> > 
> > On x86, and older ARM cores we see that 100% of the time at least 1 in
> > 250 tries succeeds.
> > 
> > With the new CPU cores we see more like 9 out of 10 time there are 0
> > in 250 tries that succeed. Ie we can go thousands of times without
> > seeing any successful WC combine.
> > 
> > The STP block brings it back to 100% of the time 1 in 250 succeed.
> 
> That's a bit confusing to me: 1 in 250 succeeding is still pretty rare.
> But I guess what your benchmark says is that at least 1 succeeded to
> write-combine and it might as well be all 250 tries. It's more
> interesting to see if there's actual performance gain in real-world
> traffic, not just some artificial benchmark (I may have misunderstood
> your numbers above).

Yes, I just don't have better data available to say that 250/250
succeeded, but we expect that is the case.

We have now something like 20 years experiance with write combining
performance on x86 systems. It brings real world gains in real word
HPC applications.

Indeed, the reason this even came up was because one of our existing
applications was performing unexpectedly badly on these ARM64 servers.

We even have data showing that having the CPU do all the write
combining steps and then fail to get writecombining is notably slower
than just assuming no write combining. It is why we go through the
trouble to test the physical HW.

> > However, in userspace we have long been using ST4 to create a
> > single-instruction 64 byte store on ARM64. As far as I know this is
> > highly reliable. I don't have direct data on the STP configuration.
> 
> Personally I'd optimise the mempcy_toio() arm64 implementation to do
> STPs if the alignment is right (like we do for classic memcpy()).
> There's a slight overhead for alignment checking but I suspect it would
> be lost as long as you can get the write-combining. Not sure whether the
> interspersed reads in memcpy_toio() would somehow prevent the
> write-combining.

I understand on these new CPUs anything other than a block of
contiguous STPs is risky to break the WC. I was told we should not
have any loads between them.

So we can't just update memcpy_toio to optimize a 128 bit store
variant like memcpy might. We actually need a special case just for 64
byte.

IMHO it does not look good as the chance any existing callers can use
this optmized 64B path is probably small, but everyone has to pay the
costs to check for it.

I also would not do this on x86 - Pathscale apparently decided the
needed special __iowrite*_copy() things to actually make this work on
xome x86 systems - I'm very leary to change x86 stuff away from the 64
bit copy loopw we know works already on x86.

IMHO encoding the alignment expectation in the API is best, especially
since this is typically a performance path.

> A memcpy_toio_64() can use the new ST64B instruction if available or
> fall back to memcpy_toio() on arm64. It should also have the DGH
> instruction (io_stop_wc()) but only if falling back to classic
> memcpy_toio(). We don't need DGH with ST64B.

I'm told it is problematic, something about ST64B not working with
NORMAL_NC.

We could fold the DGH into the helper though. IHMO I'd like to see how
ST64B actually gets implemented before doing that. If the note about
the NORMAL_NC is true then we need a lot more infrastructure to
actually use it.

Also in a future ST64B world we are going to see HW start relying on
large TLPs, not just being an optional performance win. To my mind it
makes more sense that there is an API that guarantees a large TLP or
oops. We really don't want an automatic fallback to memcpy.

Jason