On 1/26/21 4:07 PM, Jason Gunthorpe wrote: > On Tue, Jan 26, 2021 at 01:21:46PM -0800, Mike Kravetz wrote: >> On 1/26/21 11:21 AM, Joao Martins wrote: >>> On 1/26/21 6:08 PM, Mike Kravetz wrote: >>>> On 1/25/21 12:57 PM, Joao Martins wrote: >>>>> >>>>> +static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma, >>>>> + int refs, struct page **pages, >>>>> + struct vm_area_struct **vmas) >>>>> +{ >>>>> + int nr; >>>>> + >>>>> + for (nr = 0; nr < refs; nr++) { >>>>> + if (likely(pages)) >>>>> + pages[nr] = page++; >>>>> + if (vmas) >>>>> + vmas[nr] = vma; >>>>> + } >>>>> +} >>>>> + >>>>> long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, >>>>> struct page **pages, struct vm_area_struct **vmas, >>>>> unsigned long *position, unsigned long *nr_pages, >>>>> @@ -4918,28 +4932,16 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, >>>>> continue; >>>>> } >>>>> >>>>> - refs = 0; >>>>> + refs = min3(pages_per_huge_page(h) - pfn_offset, >>>>> + (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder); >>>>> >>>>> -same_page: >>>>> - if (pages) >>>>> - pages[i] = mem_map_offset(page, pfn_offset); >>>>> + if (pages || vmas) >>>>> + record_subpages_vmas(mem_map_offset(page, pfn_offset), >>>> >>>> The assumption made here is that mem_map is contiguous for the range of >>>> pages in the hugetlb page. I do not believe you can make this assumption >>>> for (gigantic) hugetlb pages which are > MAX_ORDER_NR_PAGES. For example, >>>> >> >> Thinking about this a bit more ... >> >> mem_map can be accessed contiguously if we have a virtual memmap. Correct? >> I suspect virtual memmap may be the most common configuration today. However, >> it seems we do need to handle other configurations. >> >>> That would mean get_user_pages_fast() and put_user_pages_fast() are broken for anything >>> handling PUDs or above? See record_subpages() in gup_huge_pud() or even gup_huge_pgd(). >>> It's using the same page++. >> >> Yes, I believe those would also have the issue. >> Cc: John and Jason as they have spent a significant amount of time in gup >> code recently. There may be something that makes that code safe? > > I'm looking at Matt's folio patches and see: > > +static inline struct folio *next_folio(struct folio *folio) > +{ > + return folio + folio_nr_pages(folio); > +} > > And checking page_trans_huge_mapcount(): > > for (i = 0; i < thp_nr_pages(page); i++) { > mapcount = atomic_read(&page[i]._mapcount) + 1; > > And we have the same logic in hmm_vma_walk_pud(): > > if (pud_huge(pud) && pud_devmap(pud)) { > pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); > for (i = 0; i < npages; ++i, ++pfn) > hmm_pfns[i] = pfn | cpu_flags; > > So, if page[n] does not access the tail pages of a compound we have > many more people who are surprised by this than just GUP. > > Where are these special rules for hugetlb compound tails documented? > Why does it need to be like this? The commit where this was first addressed/pointed out is 69d177c2fc70 "hugetlbfs: handle pages higher order than MAX_ORDER" from back in 2008. I only know about this because I stumbled upon it a few times in the hugetlb code. Although, it does not appear to be hugetlb specific. As pointed out by Joao, you can also see the differences in pfn_to_page for CONFIG_SPARSE_VMEMMAP and CONFIG_SPARSEMEM. The only time we might have issues is with CONFIG_SPARSEMEM. I would bet CONFIG_SPARSE_VMEMMAP is far more common. Cc: Dave as he has sparsemem history. -- Mike Kravetz > Isn't it saner to forbid a compound and its tails from being > non-linear in the page array? That limits when compounds can be > created, but seems more likely to happen than a full mm audit to find > all the places that assume linearity.