On Fri, Feb 23, 2024 at 01:52:37PM +0000, David Laight wrote: > > > Since writes get 'posted' all over the place. > > > How many writes do you need to do before write-combining makes a > > > difference? > > > > The issue is that the HW can optimize if the entire transaction is > > presented in one TLP, if it has to reassemble the transaction it takes > > a big slow path hit. > > Ah, so you aren't optimising to reduce the number of TLP for > (effectively) a write to a memory buffer, but have a pcie slave > that really want to see (for example) the writes for a ring buffer > entry in a single TLP? > > So you really want something that (should) generate a 16 (or 32) > byte TLP? Rather than abusing the function that is expected to > generate multiple 8 byte TLP to generate larger TLP. __iowriteXX_copy() was originally created by Pathscale (an RDMA device company) to support RDMA drivers doing exactly this workload. It is not an abuse. > It is rather a shame that there isn't an efficient way to get > access to a couple of large SIMD registers. Yes, userspace uses SIMD to make this work alot better and run faster. Jason