Re: [PATCH] mm: clear 1G pages with streaming stores on x86

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Sat, 7 Mar 2020 14:06:16 -0800

On Fri,  6 Mar 2020 17:03:53 -0800 Cannon Matthews <cannonmatthews@xxxxxxxxxx> wrote:

> Reimplement clear_gigantic_page() to clear gigabytes pages using the
> non-temporal streaming store instructions that bypass the cache
> (movnti), since an entire 1GiB region will not fit in the cache anyway.
> 
> Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on
> average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti`
> and optimizing the control flow over the constituent small pages, this
> can be improved roughly by a factor of 3-4x, with the 512GiB mlock()
> taking only 34 seconds on average, or 67ms/GiB.
> 
> The assembly code for the __clear_page_nt routine is more or less
> taken directly from the output of gcc with -O3 for this function with
> some tweaks to support arbitrary sizes and moving memory barriers:
> 
> void clear_page_nt_64i (void *page)
> {
>   for (int i = 0; i < GiB /sizeof(long long int); ++i)
>     {
>       _mm_stream_si64 (((long long int*)page) + i, 0);
>     }
>   sfence();
> }
> 
> Tested:
> 	Time to `mlock()` a 512GiB region on broadwell CPU
> 				AVG time (s)	% imp.	ms/page
> 	clear_page_erms		133.584		-	261
> 	clear_page_nt		34.154		74.43%	67
> 
> An earlier version of this code was sent as an RFC patch ~July 2018
> https://patchwork.kernel.org/patch/10543193/ but never merged.
> 
> ...
>
>  MAINTAINERS                        |  1 +
>  arch/x86/Kconfig                   |  4 ++++
>  arch/x86/include/asm/page_64.h     |  1 +
>  arch/x86/lib/Makefile              |  2 +-
>  arch/x86/lib/clear_gigantic_page.c | 28 ++++++++++++++++++++++++++++
>  arch/x86/lib/clear_page_64.S       | 19 +++++++++++++++++++
>  include/linux/mm.h                 |  2 ++
>  mm/memory.c                        |  2 ++

Please cc the x86 maintainers on such things.

>
> ...
>
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -70,6 +70,7 @@ config X86
>  	select ARCH_HAS_KCOV			if X86_64
>  	select ARCH_HAS_MEM_ENCRYPT
>  	select ARCH_HAS_MEMBARRIER_SYNC_CORE
> +	select ARCH_HAS_CLEAR_GIGANTIC_PAGE	if X86_64
>  	select ARCH_HAS_PMEM_API		if X86_64
>  	select ARCH_HAS_PTE_DEVMAP		if X86_64
>  	select ARCH_HAS_PTE_SPECIAL
> @@ -290,6 +291,9 @@ config ARCH_MAY_HAVE_PC_FDC
>  config GENERIC_CALIBRATE_DELAY
>  	def_bool y
>  
> +config ARCH_HAS_CLEAR_GIGANTIC_PAGE
> +	bool

Opinions might differ, but I believe the Linus-approved way of doing
this sort of thing is

#ifndef clear_gigantic_page
extern void clear_gigantic_page(...);
#define clear_gigantic_page clear_gigantic_page
#endif

alternatively, __weak works nicely.

>
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2856,6 +2856,8 @@ enum mf_action_page_type {
>  };
>  
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> +extern void clear_gigantic_page(struct page *page, unsigned long addr,
> +				unsigned int pages);

Shouldn't this be inside #ifdef CONFIG_ARCH_HAS_CLEAR_GIGANTIC_PAGE?

Also, the third argument here is called "pages" but here:

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4706,6 +4706,7 @@ static inline void process_huge_page(
>  	}
>  }
>  
> +#ifndef CONFIG_ARCH_HAS_CLEAR_GIGANTIC_PAGE
>  static void clear_gigantic_page(struct page *page,
>  				unsigned long addr,
>  				unsigned int pages_per_huge_page)

it is called "pages_per_huge_page".

> @@ -4720,6 +4721,7 @@ static void clear_gigantic_page(struct page *page,
>  		clear_user_highpage(p, addr + i * PAGE_SIZE);
>  	}
>  }
> +#endif  /* CONFIG_ARCH_HAS_CLEAR_GIGANTIC_PAGE */
>  
> ...