Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

Peter Xu <peterx@xxxxxxxxxx> · Tue, 7 Feb 2023 18:13:21 -0500

James,

On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > Here is the result: [1] (sorry it took a little while heh). The

Thanks.  From what I can tell, that number shows that it'll be great we
start with your rfcv1 mapcount approach, which mimics what's proposed by
Matthew for generic folio.

> > implementation of the "RFC v1" way is pretty horrible[2] (and this

Any more information on why it's horrible? :)

A quick comment is I'm wondering whether that "whether we should boost the
mapcount" value can be hidden in hugetlb_pte* so you don't need to pass
over a lot of bool* deep into the hgm walk routines.

> > implementation probably has bugs anyway; it doesn't account for the
> > folio_referenced() problem).

I thought we reached a consensus on the resolution, by a proposal to remove
folio_referenced_arg.mapcount.  Is it not working for some reason?

> >
> > Matthew is trying to solve the same problem with THPs right now: [3].
> > I haven't figured out how we can apply Matthews's approach to HGM
> > right now, but there probably is a way. (If we left the mapcount
> > increment bits in the same place, we couldn't just check the
> > hstate-level PTE; it would have already been made present.)

I'm just worried that (1) this may add yet another dependency to your work
which is still during discussion phase, and (2) whether the folio approach
is easily applicable here, e.g., we may not want to populate all the ptes
for hugetlb HGMs by default.

> >
> > We could:
> > - use the THP-like way and tolerate ~1 second collapses
> 
> Another thought here. We don't necessarily *need* to collapse the page
> table mappings in between mmu_notifier_invalidate_range_start() and
> mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
> we aren't punching any holes, and we aren't changing permission bits.
> If we had an MMU notifier that simply informed KVM that we collapsed
> the page tables *after* we finished collapsing, then it would be ok
> for hugetlb_collapse() to be slow.

That's a great point!  It'll definitely apply to either approach.

> 
> If this MMU notifier is something that makes sense, it probably
> applies to MADV_COLLAPSE for THPs as well.

THPs are definitely different, mmu notifiers should be required there,
afaict.  Isn't that what the current code does?

See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon.

> 
> 
> > - use the (non-RFC) v1 way and tolerate the migration/smaps differences
> > - use the RFC v1 way and tolerate the complicated mapcount accounting
> > - flesh out [3] and see if it can be applied to HGM nicely
> >
> > I'm happy to go with any of these approaches.
> >
> > [1]: https://pastebin.com/raw/hJzFJHiD
> > [2]: https://github.com/48ca/linux/commit/4495f16a09b660aff44b3edcc125aa3a3df85976
> > [3]: https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@xxxxxxxxxxxxxxxxxxxx/
> 
> - James
> 

-- 
Peter Xu