On Wed, Jan 24, 2024 at 03:26:34PM -0400, Jason Gunthorpe wrote: > The suggestion that it should not have any interleaving instructions > and use STP came from our CPU architecture team. I got some more details here. They point to the ARM publication about write combining https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/13-150-00-00-00-00-10-12/Understanding_5F00_Write_5F00_Combining_5F00_on_5F00_Arm_5F00_V.1.0.pdf specifically to the example code using 4x 128 bit NEON stores. They point at the actual CPU design and say it is optimized for 128 bit stores (STP and ST4 included, it seems). 64 bit stores trigger some different behavior. I have no way to know if it will be OK for other drivers that expect this to be a performance path in the kernel. Are you *sure* you want to do this str version? If it works for mlx5 I will send the patch and the other companies can come later with performance data. Jason