On Tue, Feb 20, 2024 at 09:17:08PM -0400, Jason Gunthorpe wrote: > The kernel provides driver support for using write combining IO memory > through the __iowriteXX_copy() API which is commonly used as an optional > optimization to generate 16/32/64 byte MemWr TLPs in a PCIe environment. > > iomap_copy.c provides a generic implementation as a simple 4/8 byte at a > time copy loop that has worked well with past ARM64 CPUs, giving a high > frequency of large TLPs being successfully formed. > > However modern ARM64 CPUs are quite sensitive to how the write combining > CPU HW is operated and a compiler generated loop with intermixed > load/store is not sufficient to frequently generate a large TLP. The CPUs > would like to see the entire TLP generated by consecutive store > instructions from registers. Compilers like gcc tend to intermix loads and > stores and have poor code generation, in part, due to the ARM64 situation > that writeq() does not codegen anything other than "[xN]". However even > with that resolved compilers like clang still do not have good code > generation. > > This means on modern ARM64 CPUs the rate at which __iowriteXX_copy() > successfully generates large TLPs is very small (less than 1 in 10,000) > tries), to the point that the use of WC is pointless. > > Implement __iowrite32/64_copy() specifically for ARM64 and use inline > assembly to build consecutive blocks of STR instructions. Provide direct > support for 64/32/16 large TLP generation in this manner. Optimize for > common constant lengths so that the compiler can directly inline the store > blocks. > > This brings the frequency of large TLP generation up to a high level that > is comparable with older CPU generations. > > As the __iowriteXX_copy() family of APIs is intended for use with WC > incorporate the DGH hint directly into the function. > > Cc: Arnd Bergmann <arnd@xxxxxxxx> > Cc: Catalin Marinas <catalin.marinas@xxxxxxx> > Cc: Will Deacon <will@xxxxxxxxxx> > Cc: Mark Rutland <mark.rutland@xxxxxxx> > Cc: linux-arch@xxxxxxxxxxxxxxx > Cc: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx > Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx> Apart from the slightly more complicated code, I don't expect it to make things worse on any of the existing hardware. So, with the typo fix that Will mentioned: Reviewed-by: Catalin Marinas <catalin.marinas@xxxxxxx>