On Mon, Nov 27, 2023 at 09:45:05AM -0400, Jason Gunthorpe wrote: > On Mon, Nov 27, 2023 at 12:42:41PM +0000, Catalin Marinas wrote: > > > > What's the actual requirement here? Is this just for performance? > > > > > > Yes, just performance. > > > > Do you have any rough numbers (percentage)? It's highly > > microarchitecture-dependent until we get the ST64B instruction. > > The current C code is an open coded store loop. The kernel does 250 > tries and measures if any one of them succeeds to combine. > > On x86, and older ARM cores we see that 100% of the time at least 1 in > 250 tries succeeds. > > With the new CPU cores we see more like 9 out of 10 time there are 0 > in 250 tries that succeed. Ie we can go thousands of times without > seeing any successful WC combine. > > The STP block brings it back to 100% of the time 1 in 250 succeed. That's a bit confusing to me: 1 in 250 succeeding is still pretty rare. But I guess what your benchmark says is that at least 1 succeeded to write-combine and it might as well be all 250 tries. It's more interesting to see if there's actual performance gain in real-world traffic, not just some artificial benchmark (I may have misunderstood your numbers above). > However, in userspace we have long been using ST4 to create a > single-instruction 64 byte store on ARM64. As far as I know this is > highly reliable. I don't have direct data on the STP configuration. Personally I'd optimise the mempcy_toio() arm64 implementation to do STPs if the alignment is right (like we do for classic memcpy()). There's a slight overhead for alignment checking but I suspect it would be lost as long as you can get the write-combining. Not sure whether the interspersed reads in memcpy_toio() would somehow prevent the write-combining. A memcpy_toio_64() can use the new ST64B instruction if available or fall back to memcpy_toio() on arm64. It should also have the DGH instruction (io_stop_wc()) but only if falling back to classic memcpy_toio(). We don't need DGH with ST64B. > > More of a bike-shedding, I wonder whether the __iowrite*_copy() > > semantics are better suited for what you need in terms of ordering (not > > that mempcy_toio() to Normal NC memory gives us any ordering). > > I have the same remark I gave to Niklas, this does not require > alignment or an exact 64 byte size. It was clearly made to support WC > stores since Pathscale did it, but I don't see this mapping nicely to > the future 64 byte store instructions are we getting. As above, I'd suggest just using memcpy_toio() as a fallback if ST64B is not available. > We could name it __iowrite512_copy() if that makes more sense? I've been thinking at the __iowrite*_copy() and these also take a 'count' argument. I assume in this instance we don't really need one, so it's just additional overhead (more like API clutter, I doubt it makes much difference for performance). I'd say just stick to the mempcy_toio_64() but have the io_stop_wc() inside this function as we won't need it with ST64B. Well, unless someone has a better name for this function. -- Catalin