Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

Peter Xu <peterx@xxxxxxxxxx> · Wed, 8 Feb 2023 11:16:02 -0500

On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > James,
> >
> > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > Here is the result: [1] (sorry it took a little while heh). The
> >
> > Thanks.  From what I can tell, that number shows that it'll be great we
> > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > Matthew for generic folio.
> 
> Do you think the RFC v1 way is better than doing the THP-like way
> *with the additional MMU notifier*?

What's the additional MMU notifier you're referring?

> 
> >
> > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> >
> > Any more information on why it's horrible? :)
> 
> I figured the code would speak for itself, heh. It's quite complicated.
> 
> I really didn't like:
> 1. The 'inc' business in copy_hugetlb_page_range.
> 2. How/where I call put_page()/folio_put() to keep the refcount and
> mapcount synced up.
> 3. Having to check the page cache in UFFDIO_CONTINUE.

I think the complexity is one thing which I'm fine with so far.  However
when I think again about the things behind that complexity, I noticed there
may be at least one flaw that may not be trivial to work around.

It's about truncation.  The problem is now we use the pgtable entry to
represent the mapcount, but the pgtable entry cannot be zapped easily,
unless vma unmapped or collapsed.

It means e.g. truncate_inode_folio() may stop working for hugetlb (of
course, with page lock held).  The mappings will be removed for real, but
not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
the pgtable leaves, not the ones that we used to account for mapcounts.

So the kernel may see weird things, like mapcount>0 after
truncate_inode_folio() being finished completely.

For HGM to do the right thing, we may want to also remove the non-leaf
entries when truncating or doing similar things like a rmap walk to drop
any mappings for a page/folio.  Though that's not doable for now because
the locks that truncate_inode_folio() is weaker than what we need to free
the pgtable non-leaf entries - we'll need mmap write lock for that, the
same as when we unmap or collapse.

Matthew's design doesn't have such issue if the ptes need to be populated,
because mapcount is still with the leaves; not the case for us here.

If that's the case, _maybe_ we still need to start with the stupid but
working approach of subpage mapcounts.

[...]

> > > > Matthew is trying to solve the same problem with THPs right now: [3].
> > > > I haven't figured out how we can apply Matthews's approach to HGM
> > > > right now, but there probably is a way. (If we left the mapcount
> > > > increment bits in the same place, we couldn't just check the
> > > > hstate-level PTE; it would have already been made present.)
> >
> > I'm just worried that (1) this may add yet another dependency to your work
> > which is still during discussion phase, and (2) whether the folio approach
> > is easily applicable here, e.g., we may not want to populate all the ptes
> > for hugetlb HGMs by default.
> 
> That's true. I definitely don't want to wait for this either. It seems
> like Matthew's approach won't work very well for us -- when doing a
> lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all
> the PTEs to see if any of them are mapped would get really slow.

I think it'll be a common problem to userfaultfd when it comes, e.g.,
userfaultfd by design is PAGE_SIZE based so far.  It needs page size
granule on pgtable manipulations, unless we extend the userfaultfd protocol
to support folios, iiuc.

> 
> >
> > > >
> > > > We could:
> > > > - use the THP-like way and tolerate ~1 second collapses
> > >
> > > Another thought here. We don't necessarily *need* to collapse the page
> > > table mappings in between mmu_notifier_invalidate_range_start() and
> > > mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
> > > we aren't punching any holes, and we aren't changing permission bits.
> > > If we had an MMU notifier that simply informed KVM that we collapsed
> > > the page tables *after* we finished collapsing, then it would be ok
> > > for hugetlb_collapse() to be slow.
> >
> > That's a great point!  It'll definitely apply to either approach.
> >
> > >
> > > If this MMU notifier is something that makes sense, it probably
> > > applies to MADV_COLLAPSE for THPs as well.
> >
> > THPs are definitely different, mmu notifiers should be required there,
> > afaict.  Isn't that what the current code does?
> >
> > See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon.
> 
> Oh, yes, of course, MADV_COLLAPSE can actually move things around and
> properly make THPs. Thanks. But it would apply if we were only
> collapsing PTE-mapped THPs, I think?

Yes it applies I think.  And if I'm not wrong it's also doing so. :) See
collapse_pte_mapped_thp().

While for anon we always allocate a new page, hence not applicable.

-- 
Peter Xu