RE: [PATCH 4/6] arm64/io: Provide a WC friendly __iowriteXX_copy()

David Laight <David.Laight@xxxxxxxxxx> · Fri, 23 Feb 2024 13:52:37 +0000

From: Jason Gunthorpe
> Sent: 23 February 2024 13:03
> 
> On Fri, Feb 23, 2024 at 12:19:24PM +0000, David Laight wrote:
> 
> > Since writes get 'posted' all over the place.
> > How many writes do you need to do before write-combining makes a
> > difference?
> 
> The issue is that the HW can optimize if the entire transaction is
> presented in one TLP, if it has to reassemble the transaction it takes
> a big slow path hit.

Ah, so you aren't optimising to reduce the number of TLP for
(effectively) a write to a memory buffer, but have a pcie slave
that really want to see (for example) the writes for a ring buffer
entry in a single TLP?

So you really want something that (should) generate a 16 (or 32)
byte TLP? Rather than abusing the function that is expected to
generate multiple 8 byte TLP to generate larger TLP.

I'm guessing that on arm64 the ldp/stp instructions will generate
a single 16 byte TLP regardless of write combining?
They would definitely help memcpy_fromio().

Are they enough for arm64?
Getting but TLP on x86 is probably harder.
(Unless you use AVX512 registers and aligned accesses.)

It is rather a shame that there isn't an efficient way to get
access to a couple of large SIMD registers.
(eg save on stack and have the fpu code where they are for
a lazy fpu switch.)
There is quite a bit of code that would benefit, but kernel_fpu_begin()
is just too expensive.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)