Re: [PATCH 00/45] hugetlb pagewalk unification

David Hildenbrand <david@xxxxxxxxxx> · Wed, 10 Jul 2024 05:52:43 +0200

On 08.07.24 10:18, Oscar Salvador wrote:
On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote:
My thinking was if "remove hugetlb_entry" cannot wait for "remove
page_walk", because we found a reasonable way to do it better and convert
the individual users. Maybe it can't.

I've not given up hope that we can end up with something better and clearer
than the current page_walk API :)

Hi David,

Hi!

I agree that the current page_walk might be a bit convoluted, and that the
indirect functions approach is a bit of a hassle.
Having said that, let me clarify something.

Although this patchset touches the page_walk API wrt. getting rid of
hugetlb special casing all over the place, my goal was not as focused on
the page_walk as it was on the hugetlb code to gain hability to be
interpreted on PUD/PMD level.

I understand that. And it would all be easier+more straight forward if 
we wouldn't have that hugetlb CONT-PTE / CONT-PMD stuff in there that 
works similar, but different to "ordinary" cont-pte for thp.

I'm sure you stumbled over the set_huge_pte_at() on arm64 for example. 
If we, at one point *don't* use these hugetlb functions right now to 
modify hugetlb entries, we might be in trouble.

That's why I think we should maybe invest our time and effort in having 
a new pagewalker that will just batch such things naturally, and users 
that can operate on that naturally. For example: a hugetlb 
cont-pte-mapped folio will just naturally be reported as a "fully mapped 
folio", just like a THP would be if mapped in a compatible way.

Yes, this requires more work, but as raised in some patches here, 
working on individual PTEs/PMDs for hugetlb is problematic.

You have to batch every operation, to essentially teach ordinary code to 
do what the hugetlb_* special code would have done on cont-pte/cont-pmd 
things.

(as a side note, cont-pte/cont-pmd should primarily be a hint from arch 
code on how many entries we can batch, like we do in folio_pte_batch(); 
point is that we want to batch also on architectures where we don't have 
such bits, and prepare for architectures that implement various sizes of 
batching; IMHO, having cont-pte/cont-pmd checks in common code is likely 
the wrong approach. Again, folio_pte_batch() is where we tackled the 
problem differently from the THP perspective)

One of the things, among other things, that helped in creating this
mess/duplication we have wrt. hugetlb code vs mm core is that hugetlb
__always__ operates on ptes, which means that we cannot rely on the mm
core to do the right thing, and we need a bunch of hugetlb-pte functions
that knows about their thing, so we lean on that.

IMHO, that was a mistake to start with, but I was not around when it was
introduced and maybe there were good reasons to deal with that the way
it is done.
But, the thing is that my ultimate goal, is for hugetlb code to be able
to deal with PUD/PMD (pte and cont-pte is already dealt with) just like
mm core does for THP (PUD is not supported by THP, but you get me), and
that is not that difficult to do, as this patchset tries to prove.

Of course, for hugetlb to gain the hability to operate on PUD/PMD, this
means we need to add a fairly amount of code. e.g: for operating
hugepages on PUD level, code for markers on PUD/PMD level for
uffd/poison stuff (only dealt
on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc.
Basically, almost all we did for PMD-* stuff we need it for PUD as well,
and that will be around when THP gains support for PUD if it ever does
(I guess that in a few years if memory capacity keeps increasing).

E.g: pud_to_swp_entry to detect that a swp entry is hwpoison with
      is_hwpoison_entry

Yes, it is a hassle to have more code around, but IMO, this new code
will help us in 1) move away from __always__ operate on ptes 2) ease
integrate hugetlb code into mm core.

I will keep working on this patchset not because of pagewalk savings,
but because I think it will help us in have hugetlb more mm-core ready,
since the current pagewalk has to test that a hugetlb page can be
properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD,
hwpoison entries for swp on PUD/PMD, pud invalidating, etc.

If that gets accomplished, I think that a fair amount of code that lives
in hugetlb.c can be deleted/converted as less special casing will be needed.

I might be wrong and maybe I will hit a brick wall, but hopefully not.

I have an idea for a better page table walker API that would try 
batching most entries (under one PTL), and walkers can just register for 
the types they want. Hoping I will find some time to at least scetch the 
user interface soon.

That doesn't mean that this should block your work, but the 
cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I 
don't particularly like where this is going.

--
Cheers,

David / dhildenb