Re: [PATCH -V2 -mm 2/4] mm, huge page: Copy target sub-page last when copy huge page

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Thu, 24 May 2018 14:25:03 -0700



On 05/23/2018 05:58 PM, Huang, Ying wrote:
> From: Huang Ying <ying.huang@xxxxxxxxx>
> 
> Huge page helps to reduce TLB miss rate, but it has higher cache
> footprint, sometimes this may cause some issue.  For example, when
> copying huge page on x86_64 platform, the cache footprint is 4M.  But
> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
> LLC (last level cache).  That is, in average, there are 2.5M LLC for
> each core and 1.25M LLC for each thread.
> 
> If the cache contention is heavy when copying the huge page, and we
> copy the huge page from the begin to the end, it is possible that the
> begin of huge page is evicted from the cache after we finishing
> copying the end of the huge page.  And it is possible for the
> application to access the begin of the huge page after copying the
> huge page.
> 
> In commit c79b57e462b5d ("mm: hugetlb: clear target sub-page last when
> clearing huge page"), to keep the cache lines of the target subpage
> hot, the order to clear the subpages in the huge page in
> clear_huge_page() is changed to clearing the subpage which is furthest
> from the target subpage firstly, and the target subpage last.  The
> similar order changing helps huge page copying too.  That is
> implemented in this patch.  Because we have put the order algorithm
> into a separate function, the implementation is quite simple.
> 
> The patch is a generic optimization which should benefit quite some
> workloads, not for a specific use case.  To demonstrate the performance
> benefit of the patch, we tested it with vm-scalability run on
> transparent huge page.
> 
> With this patch, the throughput increases ~16.6% in vm-scalability
> anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
> system (36 cores, 72 threads).  The test case set
> /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
> anonymous memory area and populate it, then forked 36 child processes,
> each writes to the anonymous memory area from the begin to the end, so
> cause copy on write.  For each child process, other child processes
> could be seen as other workloads which generate heavy cache pressure.
> At the same time, the IPC (instruction per cycle) increased from 0.63
> to 0.78, and the time spent in user space is reduced ~7.2%.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx>

Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
-- 
Mike Kravetz

> Cc: Andi Kleen <andi.kleen@xxxxxxxxx>
> Cc: Jan Kara <jack@xxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxx>
> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
> Cc: Matthew Wilcox <mawilcox@xxxxxxxxxxxxx>
> Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: Minchan Kim <minchan@xxxxxxxxxx>
> Cc: Shaohua Li <shli@xxxxxx>
> Cc: Christopher Lameter <cl@xxxxxxxxx>
> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> ---
>  include/linux/mm.h |  3 ++-
>  mm/huge_memory.c   |  3 ++-
>  mm/memory.c        | 30 +++++++++++++++++++++++-------
>  3 files changed, 27 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7cdd8b7f62e5..d227aadaa964 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2734,7 +2734,8 @@ extern void clear_huge_page(struct page *page,
>  			    unsigned long addr_hint,
>  			    unsigned int pages_per_huge_page);
>  extern void copy_user_huge_page(struct page *dst, struct page *src,
> -				unsigned long addr, struct vm_area_struct *vma,
> +				unsigned long addr_hint,
> +				struct vm_area_struct *vma,
>  				unsigned int pages_per_huge_page);
>  extern long copy_huge_page_from_user(struct page *dst_page,
>  				const void __user *usr_src,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e9177363fe2e..1b7fd9bda1dc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1328,7 +1328,8 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  	if (!page)
>  		clear_huge_page(new_page, vmf->address, HPAGE_PMD_NR);
>  	else
> -		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
> +		copy_user_huge_page(new_page, page, vmf->address,
> +				    vma, HPAGE_PMD_NR);
>  	__SetPageUptodate(new_page);
>  
>  	mmun_start = haddr;
> diff --git a/mm/memory.c b/mm/memory.c
> index b9f573a81bbd..5d432f833d19 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4675,11 +4675,31 @@ static void copy_user_gigantic_page(struct page *dst, struct page *src,
>  	}
>  }
>  
> +struct copy_subpage_arg {
> +	struct page *dst;
> +	struct page *src;
> +	struct vm_area_struct *vma;
> +};
> +
> +static void copy_subpage(unsigned long addr, int idx, void *arg)
> +{
> +	struct copy_subpage_arg *copy_arg = arg;
> +
> +	copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
> +			   addr, copy_arg->vma);
> +}
> +
>  void copy_user_huge_page(struct page *dst, struct page *src,
> -			 unsigned long addr, struct vm_area_struct *vma,
> +			 unsigned long addr_hint, struct vm_area_struct *vma,
>  			 unsigned int pages_per_huge_page)
>  {
> -	int i;
> +	unsigned long addr = addr_hint &
> +		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
> +	struct copy_subpage_arg arg = {
> +		.dst = dst,
> +		.src = src,
> +		.vma = vma,
> +	};
>  
>  	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
>  		copy_user_gigantic_page(dst, src, addr, vma,
> @@ -4687,11 +4707,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
>  		return;
>  	}
>  
> -	might_sleep();
> -	for (i = 0; i < pages_per_huge_page; i++) {
> -		cond_resched();
> -		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
> -	}
> +	process_huge_page(addr_hint, pages_per_huge_page, copy_subpage, &arg);
>  }
>  
>  long copy_huge_page_from_user(struct page *dst_page,
>