Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

Jason Gunthorpe <jgg@xxxxxxxxxx> · Tue, 26 Mar 2024 11:02:52 -0300

On Mon, Mar 25, 2024 at 02:58:48PM -0400, Peter Xu wrote:

> > This remark would be a little easier to understand if you refer to
> > hugetlb_walk() not huge_pte_offset() - the latter is only used to
> > implement hugetlb_walk() and isn't removed by this series, while a
> > single hugetlb_walk() was removed.
> 
> Right.  Here huge_pte_offset() is the arch API that I hope we can remove,
> the hugetlb_walk() is simply the wrapper.

But arguably hugetlb_walk is the thing that should go..

In the generic code we should really try to get away from the weird
hugetlb abuse of treating every level as a pte_t. That is not how the
generic page table API works and that weirdness is part of what
motivates the arch API to supply special functions for reading. Not
every arch can just cast every level to a pte_t and still work.

But that weirdness is also saving alot of code so something else needs
to be though up..

> > Regardless, I think the point is spot on, the direction should be to
> > make the page table reading generic with minimal/no interaction with
> > hugetlbfs in the generic code.
> 
> Yes, and I also like your terms on calling them "pgtable readers".  It's a
> better way to describe the difference in that regard between
> huge_pte_offset() v.s. huge_pte_alloc().  Exactly that's my goal, that we
> should start with the "readers".

Yeah, it makes alot of sense to tackle the readers first - we are
pretty close now to having enough done to have generic readers. I
would imagine tackling everything outside mm/huge*.c to use the normal
page table API for reading.

Then consider what to do with the reading in mm/huge*.c

> The writters might change semantics when merge, and can be more
> challenging, I'll need to look into details of each one, like page fault
> handlers.  Such work may need to be analyzed case by case, and this GUP
> part is definitely the low hanging fruit comparing to the rest.

The write side is tricky, I think if the read side is sorted out then
it will be easer to reason about the write side. Today the write side
is paired with the special read side and it is extra hard to
understand if there is something weird hidden in the arch.

> measurements too when getting there.  And btw, IIUC the major challenge of
> pagewalk.c is not the removal of walk_hugetlb_range() alone - that may not
> be that hard if that's the solo purpose.  The better way to go is to remove
> mm_walk_ops.hugetlb_entry() altogether, which will cause a change in all
> callers; that's "the challenge".. pretty much labor works, not a major
> technical challenge it seems.  Not sure if it's a good news or bad..

Ugh, that is a big pain. It is relying on that hugetlbfs trick of
passing in a pte that is not a pte to make the API generic..

The more I look at this the more I think we need to get to Matthew's
idea of having some kind of generic page table API that is not tightly
tied to level. Replacing the hugetlb trick of 'everything is a PTE'
with 5 special cases in every place seems just horrible.

   struct mm_walk_ops {
       int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
   }

And many cases really want something like:
   struct mm_walk_state state;

   if (!mm_walk_seek_leaf(state, mm, address))
          goto no_present
   if (mm_walk_is_write(state)) ..

And detailed walking:
   for_each_pt_leaf(state, mm, address) {
       if (mm_walk_is_write(state)) ..
   }

Replacing it with a mm_walk_state that retains the level or otherwise
to allow decoding any entry composes a lot better. Forced Loop
unrolling can get back to the current code gen in alot of places.

It also makes the power stuff a bit nicer as the mm_walk_state could
automatically retain back pointers to the higher levels in the state
struct too...

The puzzle is how to do it and still get reasonable efficient codegen,
many operations are going to end up switching on some state->level to
know how to decode the entry.

> One thing I'll soon look into is to allow huge mappings for PFNMAP; there's
> one request from VFIO side for MMIO. Dropping mm_walk_ops.hugetlb_entry()
> seems to be good for all such purposes; well, I may need to bite the bullet
> here.. for either of the purposes to move on.

That would be a nice feature!

> > If, or how much, the hugepd write side remains special is the open
> > question, I think.
> 
> It seems balls are rolling in that aspect, I haven't looked in depth there
> in Christophe's series but it's great to have started!

Yes!

Jason