On Mon 09-03-20 03:08:20, Kirill A. Shutemov wrote: > On Fri, Mar 06, 2020 at 05:03:53PM -0800, Cannon Matthews wrote: > > Reimplement clear_gigantic_page() to clear gigabytes pages using the > > non-temporal streaming store instructions that bypass the cache > > (movnti), since an entire 1GiB region will not fit in the cache anyway. > > > > Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on > > average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti` > > and optimizing the control flow over the constituent small pages, this > > can be improved roughly by a factor of 3-4x, with the 512GiB mlock() > > taking only 34 seconds on average, or 67ms/GiB. > > > > The assembly code for the __clear_page_nt routine is more or less > > taken directly from the output of gcc with -O3 for this function with > > some tweaks to support arbitrary sizes and moving memory barriers: > > > > void clear_page_nt_64i (void *page) > > { > > for (int i = 0; i < GiB /sizeof(long long int); ++i) > > { > > _mm_stream_si64 (((long long int*)page) + i, 0); > > } > > sfence(); > > } > > > > Tested: > > Time to `mlock()` a 512GiB region on broadwell CPU > > AVG time (s) % imp. ms/page > > clear_page_erms 133.584 - 261 > > clear_page_nt 34.154 74.43% 67 > > Some macrobenchmark would be great too. > > > An earlier version of this code was sent as an RFC patch ~July 2018 > > https://patchwork.kernel.org/patch/10543193/ but never merged. > > Andi and I tried to use MOVNTI for large/gigantic page clearing back in > 2012[1]. Maybe it can be useful. > > That patchset is somewhat more complex trying to keep the memory around > the fault address hot in cache. In theory it should help to reduce latency > on the first access to the memory. > > I was not able to get convincing numbers back then for the hardware of the > time. Maybe it's better now. > > https://lore.kernel.org/r/1345470757-12005-1-git-send-email-kirill.shutemov@xxxxxxxxxxxxxxx Thanks for the reminder. I've had only a very vague recollection. Your series had a much wider scope indeed. Since then we have gained process_huge_page which tries to optimize normal huge pages. Gigantic huge pages are a bit different. They are much less dynamic from the usage POV in my experience. Micro-optimizations for the first access tends to not matter at all as it is usually pre-allocation scenario. On the other hand, speeding up the initialization sounds like a good thing in general. It will be a single time benefit but if the additional code is not hard to maintain then I would be inclined to take it even with "artificial" numbers state above. There really shouldn't be other downsides except for the code maintenance, right? -- Michal Hocko SUSE Labs