On Fri, 6 Mar 2020 17:03:53 -0800 Cannon Matthews <cannonmatthews@xxxxxxxxxx> wrote: > Reimplement clear_gigantic_page() to clear gigabytes pages using the > non-temporal streaming store instructions that bypass the cache > (movnti), since an entire 1GiB region will not fit in the cache anyway. > > Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on > average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti` > and optimizing the control flow over the constituent small pages, this > can be improved roughly by a factor of 3-4x, with the 512GiB mlock() > taking only 34 seconds on average, or 67ms/GiB. > > The assembly code for the __clear_page_nt routine is more or less > taken directly from the output of gcc with -O3 for this function with > some tweaks to support arbitrary sizes and moving memory barriers: > > void clear_page_nt_64i (void *page) > { > for (int i = 0; i < GiB /sizeof(long long int); ++i) > { > _mm_stream_si64 (((long long int*)page) + i, 0); > } > sfence(); > } > > Tested: > Time to `mlock()` a 512GiB region on broadwell CPU > AVG time (s) % imp. ms/page > clear_page_erms 133.584 - 261 > clear_page_nt 34.154 74.43% 67 > > An earlier version of this code was sent as an RFC patch ~July 2018 > https://patchwork.kernel.org/patch/10543193/ but never merged. > > ... > > MAINTAINERS | 1 + > arch/x86/Kconfig | 4 ++++ > arch/x86/include/asm/page_64.h | 1 + > arch/x86/lib/Makefile | 2 +- > arch/x86/lib/clear_gigantic_page.c | 28 ++++++++++++++++++++++++++++ > arch/x86/lib/clear_page_64.S | 19 +++++++++++++++++++ > include/linux/mm.h | 2 ++ > mm/memory.c | 2 ++ Please cc the x86 maintainers on such things. > > ... > > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -70,6 +70,7 @@ config X86 > select ARCH_HAS_KCOV if X86_64 > select ARCH_HAS_MEM_ENCRYPT > select ARCH_HAS_MEMBARRIER_SYNC_CORE > + select ARCH_HAS_CLEAR_GIGANTIC_PAGE if X86_64 > select ARCH_HAS_PMEM_API if X86_64 > select ARCH_HAS_PTE_DEVMAP if X86_64 > select ARCH_HAS_PTE_SPECIAL > @@ -290,6 +291,9 @@ config ARCH_MAY_HAVE_PC_FDC > config GENERIC_CALIBRATE_DELAY > def_bool y > > +config ARCH_HAS_CLEAR_GIGANTIC_PAGE > + bool Opinions might differ, but I believe the Linus-approved way of doing this sort of thing is #ifndef clear_gigantic_page extern void clear_gigantic_page(...); #define clear_gigantic_page clear_gigantic_page #endif alternatively, __weak works nicely. > > ... > > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2856,6 +2856,8 @@ enum mf_action_page_type { > }; > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > +extern void clear_gigantic_page(struct page *page, unsigned long addr, > + unsigned int pages); Shouldn't this be inside #ifdef CONFIG_ARCH_HAS_CLEAR_GIGANTIC_PAGE? Also, the third argument here is called "pages" but here: > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4706,6 +4706,7 @@ static inline void process_huge_page( > } > } > > +#ifndef CONFIG_ARCH_HAS_CLEAR_GIGANTIC_PAGE > static void clear_gigantic_page(struct page *page, > unsigned long addr, > unsigned int pages_per_huge_page) it is called "pages_per_huge_page". > @@ -4720,6 +4721,7 @@ static void clear_gigantic_page(struct page *page, > clear_user_highpage(p, addr + i * PAGE_SIZE); > } > } > +#endif /* CONFIG_ARCH_HAS_CLEAR_GIGANTIC_PAGE */ > > ...