On 8/30/21 2:07 PM, Jason Gunthorpe wrote: > On Fri, Aug 27, 2021 at 07:34:54PM +0100, Joao Martins wrote: >> On 8/27/21 5:25 PM, Jason Gunthorpe wrote: >>> On Fri, Aug 27, 2021 at 03:58:13PM +0100, Joao Martins wrote: >>> >>>> #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) >>>> static int __gup_device_huge(unsigned long pfn, unsigned long addr, >>>> unsigned long end, unsigned int flags, >>>> struct page **pages, int *nr) >>>> { >>>> - int nr_start = *nr; >>>> + int refs, nr_start = *nr; >>>> struct dev_pagemap *pgmap = NULL; >>>> int ret = 1; >>>> >>>> do { >>>> - struct page *page = pfn_to_page(pfn); >>>> + struct page *head, *page = pfn_to_page(pfn); >>>> + unsigned long next = addr + PAGE_SIZE; >>>> >>>> pgmap = get_dev_pagemap(pfn, pgmap); >>>> if (unlikely(!pgmap)) { >>>> @@ -2252,16 +2265,25 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, >>>> ret = 0; >>>> break; >>>> } >>>> - SetPageReferenced(page); >>>> - pages[*nr] = page; >>>> - if (unlikely(!try_grab_page(page, flags))) { >>>> - undo_dev_pagemap(nr, nr_start, flags, pages); >>>> + >>>> + head = compound_head(page); >>>> + /* @end is assumed to be limited at most one compound page */ >>>> + if (PageHead(head)) >>>> + next = end; >>>> + refs = record_subpages(page, addr, next, pages + *nr); >>>> + >>>> + SetPageReferenced(head); >>>> + if (unlikely(!try_grab_compound_head(head, refs, flags))) { >>>> + if (PageHead(head)) >>>> + ClearPageReferenced(head); >>>> + else >>>> + undo_dev_pagemap(nr, nr_start, flags, pages); >>>> ret = 0; >>>> break; >>> >>> Why is this special cased for devmap? >>> >>> Shouldn't everything processing pud/pmds/etc use the same basic loop >>> that is similar in idea to the 'for_each_compound_head' scheme in >>> unpin_user_pages_dirty_lock()? >>> >>> Doesn't that work for all the special page type cases here? >> >> We are iterating over PFNs to create an array of base pages (regardless of page table >> type), rather than iterating over an array of pages to work on. > > That is part of it, yes, but the slow bit here is to minimally find > the head pages and do the atomics on them, much like the > unpin_user_pages_dirty_lock() > > I would think this should be designed similar to how things work on > the unpin side. > I don't think it's the same thing. The bit you say 'minimally find the head pages' carries a visible overhead in unpin_user_pages() as we are checking each of the pages belongs to the same head page -- because you can pass an arbritary set of pages. This does have a cost which is not in gup-fast right now AIUI. Whereas in our gup-fast 'handler' you already know that you are processing a contiguous chunk of pages. If anything, we are closer to unpin_user_page_range*() than unpin_user_pages(). > Sweep the page tables to find a proper start/end - eg even if a > compound is spread across multiple pte/pmd/pud/etc we should find a > linear range of starting PFN (ie starting page*) and npages across as > much of the page tables as we can manage. This is the same as where > things end up in the unpin case where all the contiguous PFNs are > grouped togeher into a range. > > Then 'assign' that range to the output array which requires walking > over each compount_head in the range and pinning it, then writing out > the tail pages to the output struct page array. > > And this approach should apply universally no matter what is under the > pte's - ie huge pages, THPs and devmaps should all be treated the same > way. Currently each case is different, like above which is unique to > device_huge. > Only devmap gup-fast is different IIUC. Switching to similar iteration logic to unpin would look something like this (still untested): for_each_compound_range(index, &page, npages, head, refs) { pgmap = get_dev_pagemap(pfn + *nr, pgmap); if (unlikely(!pgmap)) { undo_dev_pagemap(nr, nr_start, flags, pages); ret = 0; break; } SetPageReferenced(head); if (unlikely(!try_grab_compound_head(head, refs, flags))) { if (PageHead(head)) ClearPageReferenced(head); else undo_dev_pagemap(nr, nr_start, flags, pages); ret = 0; break; } record_subpages(page + *nr, addr, addr + (refs << PAGE_SHIFT), pages + *nr); *(nr) += refs; addr += (refs << PAGE_SHIFT); } But it looks to be a tidbit more complex and not really aligning with the rest of gup-fast. All in all, I am dealing with the fact that 1) devmap pmds/puds may not be represented with compound pages and 2) we temporarily grab dev_pagemap reference prior to pinning the page. Those two items is what makes this different than THPs/HugeTLB (which do have the same logic). And thus it's what lead me to *slightly* improve gup_device_huge().