On Fri, Jan 26, 2024 at 02:56:31PM +0000, Catalin Marinas wrote: > On Thu, Jan 25, 2024 at 01:43:33PM -0400, Jason Gunthorpe wrote: > > On Wed, Jan 24, 2024 at 03:26:34PM -0400, Jason Gunthorpe wrote: > > > The suggestion that it should not have any interleaving instructions > > > and use STP came from our CPU architecture team. > > > > I got some more details here. > > > > They point to the ARM publication about write combining > > > > https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/13-150-00-00-00-00-10-12/Understanding_5F00_Write_5F00_Combining_5F00_on_5F00_Arm_5F00_V.1.0.pdf > > > > specifically to the example code using 4x 128 bit NEON stores. > > That's an example but this document doesn't make any statements about > 64-bit writes. ARM has consistently left this area as informally specified by documents like this. This document arose specifically because a certain implementation choose an architecturally complaint way to do write combining but it was informally decided that Linux upstream would not support it. These gaps were discovered during DOE's path finding Astra supercomputer program about 6 years ago during testing with mlx5 devices. The document was specifically intended to guide HPC implementations expecting to run inside the Linux ecosystem. Based on all this I'm not surprised that the ecosystem has decided to focus primarily on consecutive 128 bit writes, absent any other guidance designers are following what information they have. Jason