On Mon, Nov 27, 2023 at 12:42:41PM +0000, Catalin Marinas wrote: > > > What's the actual requirement here? Is this just for performance? > > > > Yes, just performance. > > Do you have any rough numbers (percentage)? It's highly > microarchitecture-dependent until we get the ST64B instruction. The current C code is an open coded store loop. The kernel does 250 tries and measures if any one of them succeeds to combine. On x86, and older ARM cores we see that 100% of the time at least 1 in 250 tries succeeds. With the new CPU cores we see more like 9 out of 10 time there are 0 in 250 tries that succeed. Ie we can go thousands of times without seeing any successful WC combine. The STP block brings it back to 100% of the time 1 in 250 succeed. This is a statistical lower bound, based on what we see performance wise it almost always works. However, in userspace we have long been using ST4 to create a single-instruction 64 byte store on ARM64. As far as I know this is highly reliable. I don't have direct data on the STP configuration. > More of a bike-shedding, I wonder whether the __iowrite*_copy() > semantics are better suited for what you need in terms of ordering (not > that mempcy_toio() to Normal NC memory gives us any ordering). I have the same remark I gave to Niklas, this does not require alignment or an exact 64 byte size. It was clearly made to support WC stores since Pathscale did it, but I don't see this mapping nicely to the future 64 byte store instructions are we getting. We could name it __iowrite512_copy() if that makes more sense? Thanks, Jason