On Mon, Mar 09, 2020 at 08:38:31AM -0700, Andi Kleen wrote: > > Gigantic huge pages are a bit different. They are much less dynamic from > > the usage POV in my experience. Micro-optimizations for the first access > > tends to not matter at all as it is usually pre-allocation scenario. On > > the other hand, speeding up the initialization sounds like a good thing > > in general. It will be a single time benefit but if the additional code > > is not hard to maintain then I would be inclined to take it even with > > "artificial" numbers state above. There really shouldn't be other downsides > > except for the code maintenance, right? > > There's a cautious tale of the old crappy RAID5 XOR assembler functions which > were optimized a long time ago for the Pentium1, and stayed around, > even though the compiler could actually do a better job. > > String instructions are constantly improving in performance (Broadwell is > very old at this point) Most likely over time (and maybe even today > on newer CPUs) you would need much more sophisticated unrolled MOVNTI variants > (or maybe even AVX-*) to be competitive. Presumably you have access to current and maybe even some unreleased CPUs ... I mean, he's posted the patches, so you can test this hypothesis.