On Tue, Mar 10, 2020 at 05:21:30PM -0700, Cannon Matthews wrote: > On Mon, Mar 9, 2020 at 11:37 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > On Mon, Mar 09, 2020 at 08:38:31AM -0700, Andi Kleen wrote: > > > > Gigantic huge pages are a bit different. They are much less dynamic from > > > > the usage POV in my experience. Micro-optimizations for the first access > > > > tends to not matter at all as it is usually pre-allocation scenario. On > > > > the other hand, speeding up the initialization sounds like a good thing > > > > in general. It will be a single time benefit but if the additional code > > > > is not hard to maintain then I would be inclined to take it even with > > > > "artificial" numbers state above. There really shouldn't be other downsides > > > > except for the code maintenance, right? > > > > > > There's a cautious tale of the old crappy RAID5 XOR assembler functions which > > > were optimized a long time ago for the Pentium1, and stayed around, > > > even though the compiler could actually do a better job. > > > > > > String instructions are constantly improving in performance (Broadwell is > > > very old at this point) Most likely over time (and maybe even today > > > on newer CPUs) you would need much more sophisticated unrolled MOVNTI variants > > > (or maybe even AVX-*) to be competitive. > > > > Presumably you have access to current and maybe even some unreleased > > CPUs ... I mean, he's posted the patches, so you can test this hypothesis. > > I don't have the data at hand, but could reproduce it if strongly > desired, but I've also tested this on skylake and cascade lake, and > we've had success running with this for a while now. > > When developing this originally, I tested all of this compared with > AVX-* instructions as well as the string ops, they all seemed to be > functionally equivalent, and all were beat out by this MOVNTI thing for > large regions of 1G pages. > > There is probably room to further optimize the MOVNTI stuff with better > loop unrolling or optimizations, if anyone has specific suggestions I'm > happy to try to incorporate them, but this has shown to be effective as > written so far, and I think I lack that assembly expertise to micro > optimize further on my own. Andi's point is that string instructions might be a better bet in a long run. You may win something with MOVNTI on current CPUs, but it may become a burden on newer microarchitectures when string instructions improves. Nobody realistically would re-validate if MOVNTI microoptimazation still make sense for every new microarchitecture. > > But just in general, while there are probably some ways this could be > made better, it does a good job so far for the workloads that are more > specific to 1G pages. > > Making it work for 2MiB in a convincing general purpose way is a harder > problem and feels out of scope, and further optimizations can always be > added later on for some other things. > > I'm working on a v2 of this patch addressing some of the nits mentioned > by Andrew, should have that hopefully soon. Have you got any data for a macrobenchmark? -- Kirill A. Shutemov