On 05/23/2018 05:58 PM, Huang, Ying wrote: > From: Huang Ying <ying.huang@xxxxxxxxx> > > Huge page helps to reduce TLB miss rate, but it has higher cache > footprint, sometimes this may cause some issue. For example, when > copying huge page on x86_64 platform, the cache footprint is 4M. But > on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M > LLC (last level cache). That is, in average, there are 2.5M LLC for > each core and 1.25M LLC for each thread. > > If the cache contention is heavy when copying the huge page, and we > copy the huge page from the begin to the end, it is possible that the > begin of huge page is evicted from the cache after we finishing > copying the end of the huge page. And it is possible for the > application to access the begin of the huge page after copying the > huge page. > > In commit c79b57e462b5d ("mm: hugetlb: clear target sub-page last when > clearing huge page"), to keep the cache lines of the target subpage > hot, the order to clear the subpages in the huge page in > clear_huge_page() is changed to clearing the subpage which is furthest > from the target subpage firstly, and the target subpage last. The > similar order changing helps huge page copying too. That is > implemented in this patch. Because we have put the order algorithm > into a separate function, the implementation is quite simple. > > The patch is a generic optimization which should benefit quite some > workloads, not for a specific use case. To demonstrate the performance > benefit of the patch, we tested it with vm-scalability run on > transparent huge page. > > With this patch, the throughput increases ~16.6% in vm-scalability > anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699 > system (36 cores, 72 threads). The test case set > /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big > anonymous memory area and populate it, then forked 36 child processes, > each writes to the anonymous memory area from the begin to the end, so > cause copy on write. For each child process, other child processes > could be seen as other workloads which generate heavy cache pressure. > At the same time, the IPC (instruction per cycle) increased from 0.63 > to 0.78, and the time spent in user space is reduced ~7.2%. > > Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx> Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> -- Mike Kravetz > Cc: Andi Kleen <andi.kleen@xxxxxxxxx> > Cc: Jan Kara <jack@xxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxx> > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx> > Cc: Matthew Wilcox <mawilcox@xxxxxxxxxxxxx> > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: Minchan Kim <minchan@xxxxxxxxxx> > Cc: Shaohua Li <shli@xxxxxx> > Cc: Christopher Lameter <cl@xxxxxxxxx> > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > --- > include/linux/mm.h | 3 ++- > mm/huge_memory.c | 3 ++- > mm/memory.c | 30 +++++++++++++++++++++++------- > 3 files changed, 27 insertions(+), 9 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 7cdd8b7f62e5..d227aadaa964 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2734,7 +2734,8 @@ extern void clear_huge_page(struct page *page, > unsigned long addr_hint, > unsigned int pages_per_huge_page); > extern void copy_user_huge_page(struct page *dst, struct page *src, > - unsigned long addr, struct vm_area_struct *vma, > + unsigned long addr_hint, > + struct vm_area_struct *vma, > unsigned int pages_per_huge_page); > extern long copy_huge_page_from_user(struct page *dst_page, > const void __user *usr_src, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index e9177363fe2e..1b7fd9bda1dc 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1328,7 +1328,8 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) > if (!page) > clear_huge_page(new_page, vmf->address, HPAGE_PMD_NR); > else > - copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); > + copy_user_huge_page(new_page, page, vmf->address, > + vma, HPAGE_PMD_NR); > __SetPageUptodate(new_page); > > mmun_start = haddr; > diff --git a/mm/memory.c b/mm/memory.c > index b9f573a81bbd..5d432f833d19 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4675,11 +4675,31 @@ static void copy_user_gigantic_page(struct page *dst, struct page *src, > } > } > > +struct copy_subpage_arg { > + struct page *dst; > + struct page *src; > + struct vm_area_struct *vma; > +}; > + > +static void copy_subpage(unsigned long addr, int idx, void *arg) > +{ > + struct copy_subpage_arg *copy_arg = arg; > + > + copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx, > + addr, copy_arg->vma); > +} > + > void copy_user_huge_page(struct page *dst, struct page *src, > - unsigned long addr, struct vm_area_struct *vma, > + unsigned long addr_hint, struct vm_area_struct *vma, > unsigned int pages_per_huge_page) > { > - int i; > + unsigned long addr = addr_hint & > + ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); > + struct copy_subpage_arg arg = { > + .dst = dst, > + .src = src, > + .vma = vma, > + }; > > if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { > copy_user_gigantic_page(dst, src, addr, vma, > @@ -4687,11 +4707,7 @@ void copy_user_huge_page(struct page *dst, struct page *src, > return; > } > > - might_sleep(); > - for (i = 0; i < pages_per_huge_page; i++) { > - cond_resched(); > - copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma); > - } > + process_huge_page(addr_hint, pages_per_huge_page, copy_subpage, &arg); > } > > long copy_huge_page_from_user(struct page *dst_page, >