Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages

Joao Martins <joao.m.martins@xxxxxxxxxx> · Wed, 13 Oct 2021 20:18:08 +0100

On 10/13/21 18:41, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 04:53:29PM +0100, Joao Martins wrote:
>> On 10/8/21 12:54, Jason Gunthorpe wrote:
> 
>>> The only optimization that might work here is to grab the head, then
>>> compute the extent of tail pages and amalgamate them. Holding a ref on
>>> the head also secures the tails.
>>
>> How about pmd_page(orig) / pud_page(orig) like what the rest of hugetlb/thp
>> checks do? i.e. we would pass pmd_page(orig)/pud_page(orig) to __gup_device_huge()
>> as an added @head argument. While keeping the same structure of counting tail pages
>> between @addr .. @end if we have a head page.
> 
> The right logic is what everything else does:
> 
> 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> 	refs = record_subpages(page, addr, end, pages + *nr);
> 	head = try_grab_compound_head(pud_page(orig), refs, flags);
> 
> If you can use this, or not, depends entirely on answering the
> question of why does  __gup_device_huge() exist at all.
> 
So for device-dax it seems to be an untackled oversight[*], probably
inherited from when fsdax devmap was introduced. What I don't know
is the other devmap users :(

[*] it has all the same properties as hugetlbfs AFAIU (after this series)

Certainly if any devmap PMD/PUD was represented in a single compound
page like THP/hugetlbfs then this patch would be a matter of removing
pgmap ref grab (and nuke the __gup_device_huge function existence as I
suggested earlier).

> This I don't fully know:
> 
> 1) As discussed quite a few times now, the entire get_dev_pagemap
>    stuff looks usless and should just be deleted. If you care about
>    optimizing this I would persue doing that as it will give the
>    biggest single win.
> 
I am not questioning the well-deserved improvement -- but from a pure
optimization perspective the get_dev_pagemap() cost is not
visible and quite negligeble. It is done once and only once and
subsequent calls to get_dev_pagemap with a non-NULL pgmap don't alter
the refcount and just return the pgmap object. And the xarray storing
the ranges -> pgmap won't be that big ... perhaps maybe 12 pgmaps on
a large >1T pmem system depending on your DIMM size.

The refcount update of the individual 4K page is what introduces
a seriously prohibite cost: I am seeing 10x the cost with DRAM
located struct pages (pmem located struct pages is even more ludicrous).

> 2) It breaks up the PUD/PMD into tail pages and scans them all
>    Why? Can devmap combine multiple compound_head's into the same PTE?

I am not aware of any other usage of compound pages for devmap struct pages
than this series. At least I haven't seen device-dax or fsdax using this.
Unless HMM does this stuff, or some sort of devmap page migration? P2PDMA
doesn't seem to be (yet?) caught by any of the GUP path at least before
Logan's series lands. Or am I misunderstanding things here?

>    Does devmap guarentee that the PUD/PMD points to the head page? (I
>    assume no)
> 
For device-dax yes.

> But I'm looking at this some more and I see try_get_compound_head() is
> reading the compound_head with no locking, just READ_ONCE, so that
> must be OK under GUP.
> 
I suppose one other way to get around the double atomic op would be to fail
the try_get_compound_head() by comparing the first tail page compound_head()
after grabbing the head with @refs. That is instead of comparing against
grabbed head page and the passed page argument.

> It still seems to me the same generic algorithm should work
> everywhere, once we get rid of the get_dev_pagemap
> 
>   start_pfn = pud/pmd_pfn() + pud/pmd_page_offset(addr)
>   end_pfn = start_pfn + (end - addr) // fixme
>   if (THP)
>      refs = end_pfn - start_pfn
>   if (devmap)
>      refs = 1
> 
>   do {
>      page = pfn_to_page(start_pfn)
>      head_page = try_grab_compound_head(page, refs, flags)
>      if (pud/pmd_val() != orig)
>         err
> 
>      npages = 1 << compound_order(head_page)
>      npages = min(npages, end_pfn - start_pfn)
>      for (i = 0, iter = page; i != npages) {
>      	 *pages++ = iter;
>          mem_map_next(iter, page, i)
>      }
> 
>      if (devmap && npages > 2)
>          grab_compound_head(head_page, npages - 1, flags)
>      start_pfn += npages;
>   } while (start_pfn != end_pfn)
> 
> Above needs to be cleaned up quite a bit, but is the general idea.
> 
> There is an further optimization we can put in where we can know that
> 'page' is still in a currently grab'd compound (eg last_page+1 = page,
> not past compound_order) and defer the refcount work.
> 
I was changing __gup_device_huge() with similar to the above, but yeah
it follows that algorithm as inside your do { } while() (thanks!). I can
turn __gup_device_huge() into another (renamed to like try_grab_pages())
helper and replace the callsites of gup_huge_{pud,pmd} for the THP/hugetlbfs
equivalent handling.

>> It's interesting how THP (in gup_huge_pmd()) unilaterally computes
>> tails assuming pmd_page(orig) is the head page.
> 
> I think this is an integral property of THP, probably not devmap/dax
> though?

I think the right answer is "depends on the devmap" type. device-dax with
PMD/PUDs (i.e. 2m pagesize PMEM or 1G pagesize pmem) works with the same
rules as hugetlbfs. fsdax not so much (as you say above) but it would
follow up changes to perhaps adhere to similar scheme (not exactly sure
how do deal with holes). HMM I am not sure what the rules are there.
P2PDMA seems not applicable?