From: Jason Gunthorpe > Sent: 23 February 2024 13:03 > > On Fri, Feb 23, 2024 at 12:19:24PM +0000, David Laight wrote: > > > Since writes get 'posted' all over the place. > > How many writes do you need to do before write-combining makes a > > difference? > > The issue is that the HW can optimize if the entire transaction is > presented in one TLP, if it has to reassemble the transaction it takes > a big slow path hit. Ah, so you aren't optimising to reduce the number of TLP for (effectively) a write to a memory buffer, but have a pcie slave that really want to see (for example) the writes for a ring buffer entry in a single TLP? So you really want something that (should) generate a 16 (or 32) byte TLP? Rather than abusing the function that is expected to generate multiple 8 byte TLP to generate larger TLP. I'm guessing that on arm64 the ldp/stp instructions will generate a single 16 byte TLP regardless of write combining? They would definitely help memcpy_fromio(). Are they enough for arm64? Getting but TLP on x86 is probably harder. (Unless you use AVX512 registers and aligned accesses.) It is rather a shame that there isn't an efficient way to get access to a couple of large SIMD registers. (eg save on stack and have the fpu code where they are for a lazy fpu switch.) There is quite a bit of code that would benefit, but kernel_fpu_begin() is just too expensive. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)