On Fri, Nov 24, 2023 at 10:16:15AM +0000, Mark Rutland wrote: > On Thu, Nov 23, 2023 at 09:04:31PM +0200, Leon Romanovsky wrote: > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > > > The kernel supports write combining IO memory which is commonly used to > > generate 64 byte TLPs in a PCIe environment. On many CPUs this mechanism > > is pretty tolerant and a simple C loop will suffice to generate a 64 byte > > TLP. > > > > However modern ARM64 CPUs are quite sensitive and a compiler generated > > loop is not enough to reliably generate a 64 byte TLP. Especially given > > the ARM64 issue that writel() does not codegen anything other than "[xN]" > > as the address calculation. > > > > These newer CPUs require an orderly consecutive block of stores to work > > reliably. This is best done with four STP integer instructions (perhaps > > ST64B in future), or a single ST4 vector instruction. > > > > Provide a new generic function memcpy_toio_64() which should reliably > > generate the needed instructions for the architecture, assuming address > > alignment. As the usual need for this operation is performance sensitive a > > fast inline implementation is preferred. > > There is *no* architectural sequence that is guaranteed to reliably generate a > 64-byte TLP, and this sequence won't guarnatee that (e.g. even if the CPU > *always* merged adjacent stores, we can take an interrupt mid-sequence that > would prevent that). WC is not guaranteed on any arch, that is well known. The HW has means to handle fragmented TLPs, it just hurts performance when it happens. "reliable" here means we'd like to see something like a > 90% chance of the large TLP instead of the < 1% chance with the C loop. Future ARM CPUs have the ST64B instruction which does provide the architectural guarantee, and x86 has a similar guaranteed instruction now too. > What's the actual requirement here? Is this just for performance? Yes, just performance. Jason