On 10/13/21 20:43, Jason Gunthorpe wrote: > On Wed, Oct 13, 2021 at 08:18:08PM +0100, Joao Martins wrote: >> On 10/13/21 18:41, Jason Gunthorpe wrote: >>> On Mon, Oct 11, 2021 at 04:53:29PM +0100, Joao Martins wrote: >>>> On 10/8/21 12:54, Jason Gunthorpe wrote: >>> >>>>> The only optimization that might work here is to grab the head, then >>>>> compute the extent of tail pages and amalgamate them. Holding a ref on >>>>> the head also secures the tails. >>>> >>>> How about pmd_page(orig) / pud_page(orig) like what the rest of hugetlb/thp >>>> checks do? i.e. we would pass pmd_page(orig)/pud_page(orig) to __gup_device_huge() >>>> as an added @head argument. While keeping the same structure of counting tail pages >>>> between @addr .. @end if we have a head page. >>> >>> The right logic is what everything else does: >>> >>> page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); >>> refs = record_subpages(page, addr, end, pages + *nr); >>> head = try_grab_compound_head(pud_page(orig), refs, flags); >>> >>> If you can use this, or not, depends entirely on answering the >>> question of why does __gup_device_huge() exist at all. >>> >> So for device-dax it seems to be an untackled oversight[*], probably >> inherited from when fsdax devmap was introduced. What I don't know >> is the other devmap users :( > > devmap generic infrastructure waits until all page refcounts go to > zero, and it should wait until any fast GUP is serialized as part of > the TLB shootdown - otherwise it is leaking access to the memory it > controls beyond it's shutdown > > So, I don't think the different devmap users can break this? > maybe fsdax may not honor that after removing the pgmap ref count as we talked earlier in the thread without the memory failures stuff. But my point earlier on "oversight" was about the fact that we describe PMD/PUDs with base pages. And hence why we can't do this: page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr); head = try_grab_compound_head(pud_page(orig), refs, flags); ... For devmap. >>> This I don't fully know: >>> >>> 1) As discussed quite a few times now, the entire get_dev_pagemap >>> stuff looks usless and should just be deleted. If you care about >>> optimizing this I would persue doing that as it will give the >>> biggest single win. >> >> I am not questioning the well-deserved improvement -- but from a pure >> optimization perspective the get_dev_pagemap() cost is not >> visible and quite negligeble. > > You are doing large enough GUPs then that the expensive xarray seach > is small compared to the rest? > The tests I showed are on 16G and 128G (I usually test with > 1Tb). But we are comparing 1 pgmap lookup, and 512 refcount atomic updates for a PMD sized pin (2M). It depends on what you think small is. Maybe for a 4K-16K of memory pinned, probably the pgmap lookup is more expensive. A couple months ago I remember measuring xarray lookups and that would be around ~150ns per lookup with few entries excluding the ref update. I can be pedantic and give you a more concrete measurement for get_dev_pagemap() but quite honestly don't think that it will be close with we pin a hugepage. But this is orthogonal to the pgmap ref existence, I was merely commenting on the performance aspect. >>> 2) It breaks up the PUD/PMD into tail pages and scans them all >>> Why? Can devmap combine multiple compound_head's into the same PTE? >> >> I am not aware of any other usage of compound pages for devmap struct pages >> than this series. At least I haven't seen device-dax or fsdax using >> this. > > Let me ask this question differently, is this assertion OK? > I understood the question. I was more trying to learn from you on HMM/P2PDMA use-cases as you sounded like compound pages exist elsewhere in devmap :) > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -808,8 +808,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, > } > > entry = pmd_mkhuge(pfn_t_pmd(pfn, prot)); > - if (pfn_t_devmap(pfn)) > + if (pfn_t_devmap(pfn)) { > + struct page *pfn_to_page(pfn); > + > + WARN_ON(compound_head(page) != page); > + WARN_ON(compound_order(page) != PMD_SHIFT); > entry = pmd_mkdevmap(entry); > + } > if (write) { > entry = pmd_mkyoung(pmd_mkdirty(entry)); > entry = maybe_pmd_mkwrite(entry, vma); > > (and same for insert_pfn_pud) > > You said it is OK for device/dax/device.c? > Yes, correct. (I also verified with similar snippet above to be extra-pedantic) I would like to emphasize that it is only OK *after this series*. Without this series there is neither compound order or head, nor any compound page whatsoever. > And not for fs/dax.c? > IIUC, Correct. fs/dax code is a little fuzzy on the sector to PFN translation but still: There's no compound pages in fsdax (neither do I add them). So compound_order() doesn't exist (hence order 0) and there's no head which also means that compound_head(@page) returns @page (which is a tad misleading on being a head definition as a PMD subpage would not return @page). > >> Unless HMM does this stuff, or some sort of devmap page migration? P2PDMA >> doesn't seem to be (yet?) caught by any of the GUP path at least before >> Logan's series lands. Or am I misunderstanding things here? > > Of the places that call the insert_pfn_pmd/pud call chains I only see > device/dax/device.c and fs/dax.c as being linked to devmap. So other > devmap users don't use this stuff. > >> I was changing __gup_device_huge() with similar to the above, but yeah >> it follows that algorithm as inside your do { } while() (thanks!). I can >> turn __gup_device_huge() into another (renamed to like try_grab_pages()) >> helper and replace the callsites of gup_huge_{pud,pmd} for the THP/hugetlbfs >> equivalent handling. > > I suppose it should be some #define because the (pmd_val != orig) logic > is not sharable > I wasn't going to have that pmd_val() in there. Just this refcount part: page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr); head = try_grab_compound_head(pud_page(orig), refs, flags); and __gup_device_huge() melded in. > But, yes, a general call that the walker should make at any level to > record a pfn -> npages range efficiently. > >> I think the right answer is "depends on the devmap" type. device-dax with >> PMD/PUDs (i.e. 2m pagesize PMEM or 1G pagesize pmem) works with the same >> rules as hugetlbfs. fsdax not so much (as you say above) but it would >> follow up changes to perhaps adhere to similar scheme (not exactly sure >> how do deal with holes). HMM I am not sure what the rules are there. >> P2PDMA seems not applicable? > > I would say, not "depends on the devmap", but what are the rules for > calling insert_pfn_pmd in the first place. > I am gonna translate that into HMM and P2PDMA so far aren't affected as you also said earlier above. Even after Logan's P2PDMA NVME series it appears is always PTEs. So fsdax and device-dax are the only pmd_devmap users we should care, and for pud_devmap() only device-dax. > If users are allowed the create pmds that span many compound_head's > then we need logic as I showed. Otherwise we do not. > /me nods. > And I would document this relationship in the GUP side "This do/while > is required because insert_pfn_pmd/pud() is used with compound pages > smaller than the PUD/PMD size" so it isn't so confused with just > "devmap" Also, it's not that PMDs span compound heads, it's that PMDs/PUDs use just base pages. Compound pages/head in devmap are only introduced by series and for device-dax only.