On Mon, Dec 04, 2023 at 05:31:47PM +0000, Catalin Marinas wrote: > On Mon, Nov 27, 2023 at 09:45:05AM -0400, Jason Gunthorpe wrote: > > On Mon, Nov 27, 2023 at 12:42:41PM +0000, Catalin Marinas wrote: > > > > > What's the actual requirement here? Is this just for performance? > > > > > > > > Yes, just performance. > > > > > > Do you have any rough numbers (percentage)? It's highly > > > microarchitecture-dependent until we get the ST64B instruction. > > > > The current C code is an open coded store loop. The kernel does 250 > > tries and measures if any one of them succeeds to combine. > > > > On x86, and older ARM cores we see that 100% of the time at least 1 in > > 250 tries succeeds. > > > > With the new CPU cores we see more like 9 out of 10 time there are 0 > > in 250 tries that succeed. Ie we can go thousands of times without > > seeing any successful WC combine. > > > > The STP block brings it back to 100% of the time 1 in 250 succeed. > > That's a bit confusing to me: 1 in 250 succeeding is still pretty rare. > But I guess what your benchmark says is that at least 1 succeeded to > write-combine and it might as well be all 250 tries. It's more > interesting to see if there's actual performance gain in real-world > traffic, not just some artificial benchmark (I may have misunderstood > your numbers above). Yes, I just don't have better data available to say that 250/250 succeeded, but we expect that is the case. We have now something like 20 years experiance with write combining performance on x86 systems. It brings real world gains in real word HPC applications. Indeed, the reason this even came up was because one of our existing applications was performing unexpectedly badly on these ARM64 servers. We even have data showing that having the CPU do all the write combining steps and then fail to get writecombining is notably slower than just assuming no write combining. It is why we go through the trouble to test the physical HW. > > However, in userspace we have long been using ST4 to create a > > single-instruction 64 byte store on ARM64. As far as I know this is > > highly reliable. I don't have direct data on the STP configuration. > > Personally I'd optimise the mempcy_toio() arm64 implementation to do > STPs if the alignment is right (like we do for classic memcpy()). > There's a slight overhead for alignment checking but I suspect it would > be lost as long as you can get the write-combining. Not sure whether the > interspersed reads in memcpy_toio() would somehow prevent the > write-combining. I understand on these new CPUs anything other than a block of contiguous STPs is risky to break the WC. I was told we should not have any loads between them. So we can't just update memcpy_toio to optimize a 128 bit store variant like memcpy might. We actually need a special case just for 64 byte. IMHO it does not look good as the chance any existing callers can use this optmized 64B path is probably small, but everyone has to pay the costs to check for it. I also would not do this on x86 - Pathscale apparently decided the needed special __iowrite*_copy() things to actually make this work on xome x86 systems - I'm very leary to change x86 stuff away from the 64 bit copy loopw we know works already on x86. IMHO encoding the alignment expectation in the API is best, especially since this is typically a performance path. > A memcpy_toio_64() can use the new ST64B instruction if available or > fall back to memcpy_toio() on arm64. It should also have the DGH > instruction (io_stop_wc()) but only if falling back to classic > memcpy_toio(). We don't need DGH with ST64B. I'm told it is problematic, something about ST64B not working with NORMAL_NC. We could fold the DGH into the helper though. IHMO I'd like to see how ST64B actually gets implemented before doing that. If the note about the NORMAL_NC is true then we need a lot more infrastructure to actually use it. Also in a future ST64B world we are going to see HW start relying on large TLPs, not just being an optional performance win. To my mind it makes more sense that there is an API that guarantees a large TLP or oops. We really don't want an automatic fallback to memcpy. Jason