On 1/26/21 9:21 PM, Mike Kravetz wrote: > On 1/26/21 11:21 AM, Joao Martins wrote: >> On 1/26/21 6:08 PM, Mike Kravetz wrote: >>> On 1/25/21 12:57 PM, Joao Martins wrote: >>>> >>>> +static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma, >>>> + int refs, struct page **pages, >>>> + struct vm_area_struct **vmas) >>>> +{ >>>> + int nr; >>>> + >>>> + for (nr = 0; nr < refs; nr++) { >>>> + if (likely(pages)) >>>> + pages[nr] = page++; >>>> + if (vmas) >>>> + vmas[nr] = vma; >>>> + } >>>> +} >>>> + >>>> long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, >>>> struct page **pages, struct vm_area_struct **vmas, >>>> unsigned long *position, unsigned long *nr_pages, >>>> @@ -4918,28 +4932,16 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, >>>> continue; >>>> } >>>> >>>> - refs = 0; >>>> + refs = min3(pages_per_huge_page(h) - pfn_offset, >>>> + (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder); >>>> >>>> -same_page: >>>> - if (pages) >>>> - pages[i] = mem_map_offset(page, pfn_offset); >>>> + if (pages || vmas) >>>> + record_subpages_vmas(mem_map_offset(page, pfn_offset), >>> >>> The assumption made here is that mem_map is contiguous for the range of >>> pages in the hugetlb page. I do not believe you can make this assumption >>> for (gigantic) hugetlb pages which are > MAX_ORDER_NR_PAGES. For example, >>> > > Thinking about this a bit more ... > > mem_map can be accessed contiguously if we have a virtual memmap. Correct? Right. > I suspect virtual memmap may be the most common configuration today. However, > it seems we do need to handle other configurations. > At the moment mem_map_offset(page, n) in turn does this for >= MAX_ORDER: pfn_to_page(page_to_pfn(page) + n) For CONFIG_SPARSE_VMEMMAP or FLATMEM will resolve into something: vmemmap + ((page - vmemmap) + n) It isn't really different than incrementing the @page. I can only think that CONFIG_SPARSEMEM and CONFIG_DISCONTIGMEM as the offending cases which respectively look into section info or pgdat. [CONFIG_DISCONTIGMEM doesnt isn't auto selected by any arch at the moment.] >> That would mean get_user_pages_fast() and pin_user_pages_fast() are broken for anything >> handling PUDs or above? See record_subpages() in gup_huge_pud() or even gup_huge_pgd(). >> It's using the same page++. > > Yes, I believe those would also have the issue. > Cc: John and Jason as they have spent a significant amount of time in gup > code recently. There may be something that makes that code safe? > Maybe -- Looking back, gup-fast has always relied on that page pointer arithmetic, even before its refactors around __record_subpages() and what not. >> This adjustment below probably is what you're trying to suggest. >> >> Also, nth_page() is slightly more expensive and so the numbers above change from ~4.4k >> usecs to ~7.8k usecs. > > If my thoughts about virtual memmap are correct, then could we simply have > a !vmemmap version of mem_map_offset (or similar routine) to avoid overhead? > In that case, we could ifdef out on SPARSEMEM || DISCONTIGMEM for mem_map_offset() either internally or within the helper I added for follow_hugetlb_page().