Mike Kravetz <mike.kravetz@xxxxxxxxxx> 于2019年9月4日周三 上午5:26写道: > > On 8/29/19 6:50 AM, Zhigang Lu wrote: > > From: Zhigang Lu <tonnylu@xxxxxxxxxxx> > > > > When mmapping an existing hugetlbfs file with MAP_POPULATE, we find > > it is very time consuming. For example, mmapping a 128GB file takes > > about 50 milliseconds. Sampling with perfevent shows it spends 99% > > time in the same_page loop in follow_hugetlb_page(). > > > > samples: 205 of event 'cycles', Event count (approx.): 136686374 > > - 99.04% test_mmap_huget [kernel.kallsyms] [k] follow_hugetlb_page > > follow_hugetlb_page > > __get_user_pages > > __mlock_vma_pages_range > > __mm_populate > > vm_mmap_pgoff > > sys_mmap_pgoff > > sys_mmap > > system_call_fastpath > > __mmap64 > > > > follow_hugetlb_page() is called with pages=NULL and vmas=NULL, so for > > each hugepage, we run into the same_page loop for pages_per_huge_page() > > times, but doing nothing. With this change, it takes less then 1 > > millisecond to mmap a 128GB file in hugetlbfs. > > Thanks for the analysis! > > Just curious, do you have an application that does this (mmap(MAP_POPULATE) > for an existing hugetlbfs file), or was this part of some test suite or > debug code? Yes, DPDK and SPDK actually do this in vhost_user.c. vhost_setup_mem_table() { ... mmap_size = RTE_ALIGN_CEIL(mmap_size, alignment); mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, 0); ... } > > > Signed-off-by: Zhigang Lu <tonnylu@xxxxxxxxxxx> > > Reviewed-by: Haozhong Zhang <hzhongzhang@xxxxxxxxxxx> > > Reviewed-by: Zongming Zhang <knightzhang@xxxxxxxxxxx> > > Acked-by: Matthew Wilcox <willy@xxxxxxxxxxxxx> > > --- > > mm/hugetlb.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 6d7296d..2df941a 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -4391,6 +4391,17 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, > > break; > > } > > } > > It might be helpful to add a comment here to help readers of the code. > Something like: > Thanks, I will add this comment and send a new version. > /* > * If subpage information not requested, update counters > * and skip the same_page loop below. > */ > > + > > + if (!pages && !vmas && !pfn_offset && > > + (vaddr + huge_page_size(h) < vma->vm_end) && > > + (remainder >= pages_per_huge_page(h))) { > > + vaddr += huge_page_size(h); > > + remainder -= pages_per_huge_page(h); > > + i += pages_per_huge_page(h); > > + spin_unlock(ptl); > > + continue; > > + } > > + > > same_page: > > if (pages) { > > pages[i] = mem_map_offset(page, pfn_offset); > > > > With a comment added to the code, > Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > -- > Mike Kravetz