Re: [PATCH] mm: clear 1G pages with streaming stores on x86

Michal Hocko <mhocko@xxxxxxxxxx> · Mon, 9 Mar 2020 10:06:30 +0100

On Mon 09-03-20 03:08:20, Kirill A. Shutemov wrote:
> On Fri, Mar 06, 2020 at 05:03:53PM -0800, Cannon Matthews wrote:
> > Reimplement clear_gigantic_page() to clear gigabytes pages using the
> > non-temporal streaming store instructions that bypass the cache
> > (movnti), since an entire 1GiB region will not fit in the cache anyway.
> > 
> > Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on
> > average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti`
> > and optimizing the control flow over the constituent small pages, this
> > can be improved roughly by a factor of 3-4x, with the 512GiB mlock()
> > taking only 34 seconds on average, or 67ms/GiB.
> > 
> > The assembly code for the __clear_page_nt routine is more or less
> > taken directly from the output of gcc with -O3 for this function with
> > some tweaks to support arbitrary sizes and moving memory barriers:
> > 
> > void clear_page_nt_64i (void *page)
> > {
> >   for (int i = 0; i < GiB /sizeof(long long int); ++i)
> >     {
> >       _mm_stream_si64 (((long long int*)page) + i, 0);
> >     }
> >   sfence();
> > }
> > 
> > Tested:
> > 	Time to `mlock()` a 512GiB region on broadwell CPU
> > 				AVG time (s)	% imp.	ms/page
> > 	clear_page_erms		133.584		-	261
> > 	clear_page_nt		34.154		74.43%	67
> 
> Some macrobenchmark would be great too.
> 
> > An earlier version of this code was sent as an RFC patch ~July 2018
> > https://patchwork.kernel.org/patch/10543193/ but never merged.
> 
> Andi and I tried to use MOVNTI for large/gigantic page clearing back in
> 2012[1]. Maybe it can be useful.
> 
> That patchset is somewhat more complex trying to keep the memory around
> the fault address hot in cache. In theory it should help to reduce latency
> on the first access to the memory.
> 
> I was not able to get convincing numbers back then for the hardware of the
> time. Maybe it's better now.
> 
> https://lore.kernel.org/r/1345470757-12005-1-git-send-email-kirill.shutemov@xxxxxxxxxxxxxxx

Thanks for the reminder. I've had only a very vague recollection. Your
series had a much wider scope indeed. Since then we have gained
process_huge_page which tries to optimize normal huge pages.

Gigantic huge pages are a bit different. They are much less dynamic from
the usage POV in my experience. Micro-optimizations for the first access
tends to not matter at all as it is usually pre-allocation scenario. On
the other hand, speeding up the initialization sounds like a good thing
in general. It will be a single time benefit but if the additional code
is not hard to maintain then I would be inclined to take it even with
"artificial" numbers state above. There really shouldn't be other downsides
except for the code maintenance, right?
-- 
Michal Hocko
SUSE Labs