On Wed, Oct 13, 2021 at 08:18:08PM +0100, Joao Martins wrote: > On 10/13/21 18:41, Jason Gunthorpe wrote: > > On Mon, Oct 11, 2021 at 04:53:29PM +0100, Joao Martins wrote: > >> On 10/8/21 12:54, Jason Gunthorpe wrote: > > > >>> The only optimization that might work here is to grab the head, then > >>> compute the extent of tail pages and amalgamate them. Holding a ref on > >>> the head also secures the tails. > >> > >> How about pmd_page(orig) / pud_page(orig) like what the rest of hugetlb/thp > >> checks do? i.e. we would pass pmd_page(orig)/pud_page(orig) to __gup_device_huge() > >> as an added @head argument. While keeping the same structure of counting tail pages > >> between @addr .. @end if we have a head page. > > > > The right logic is what everything else does: > > > > page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); > > refs = record_subpages(page, addr, end, pages + *nr); > > head = try_grab_compound_head(pud_page(orig), refs, flags); > > > > If you can use this, or not, depends entirely on answering the > > question of why does __gup_device_huge() exist at all. > > > So for device-dax it seems to be an untackled oversight[*], probably > inherited from when fsdax devmap was introduced. What I don't know > is the other devmap users :( devmap generic infrastructure waits until all page refcounts go to zero, and it should wait until any fast GUP is serialized as part of the TLB shootdown - otherwise it is leaking access to the memory it controls beyond it's shutdown So, I don't think the different devmap users can break this? > > This I don't fully know: > > > > 1) As discussed quite a few times now, the entire get_dev_pagemap > > stuff looks usless and should just be deleted. If you care about > > optimizing this I would persue doing that as it will give the > > biggest single win. > > I am not questioning the well-deserved improvement -- but from a pure > optimization perspective the get_dev_pagemap() cost is not > visible and quite negligeble. You are doing large enough GUPs then that the expensive xarray seach is small compared to the rest? > > 2) It breaks up the PUD/PMD into tail pages and scans them all > > Why? Can devmap combine multiple compound_head's into the same PTE? > > I am not aware of any other usage of compound pages for devmap struct pages > than this series. At least I haven't seen device-dax or fsdax using > this. Let me ask this question differently, is this assertion OK? --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -808,8 +808,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, } entry = pmd_mkhuge(pfn_t_pmd(pfn, prot)); - if (pfn_t_devmap(pfn)) + if (pfn_t_devmap(pfn)) { + struct page *pfn_to_page(pfn); + + WARN_ON(compound_head(page) != page); + WARN_ON(compound_order(page) != PMD_SHIFT); entry = pmd_mkdevmap(entry); + } if (write) { entry = pmd_mkyoung(pmd_mkdirty(entry)); entry = maybe_pmd_mkwrite(entry, vma); (and same for insert_pfn_pud) You said it is OK for device/dax/device.c? And not for fs/dax.c? > Unless HMM does this stuff, or some sort of devmap page migration? P2PDMA > doesn't seem to be (yet?) caught by any of the GUP path at least before > Logan's series lands. Or am I misunderstanding things here? Of the places that call the insert_pfn_pmd/pud call chains I only see device/dax/device.c and fs/dax.c as being linked to devmap. So other devmap users don't use this stuff. > I was changing __gup_device_huge() with similar to the above, but yeah > it follows that algorithm as inside your do { } while() (thanks!). I can > turn __gup_device_huge() into another (renamed to like try_grab_pages()) > helper and replace the callsites of gup_huge_{pud,pmd} for the THP/hugetlbfs > equivalent handling. I suppose it should be some #define because the (pmd_val != orig) logic is not sharable But, yes, a general call that the walker should make at any level to record a pfn -> npages range efficiently. > I think the right answer is "depends on the devmap" type. device-dax with > PMD/PUDs (i.e. 2m pagesize PMEM or 1G pagesize pmem) works with the same > rules as hugetlbfs. fsdax not so much (as you say above) but it would > follow up changes to perhaps adhere to similar scheme (not exactly sure > how do deal with holes). HMM I am not sure what the rules are there. > P2PDMA seems not applicable? I would say, not "depends on the devmap", but what are the rules for calling insert_pfn_pmd in the first place. If users are allowed the create pmds that span many compound_head's then we need logic as I showed. Otherwise we do not. And I would document this relationship in the GUP side "This do/while is required because insert_pfn_pmd/pud() is used with compound pages smaller than the PUD/PMD size" so it isn't so confused with just "devmap" Jason