Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk

Christophe Leroy <christophe.leroy@xxxxxxxxxx> · Fri, 31 Jan 2025 16:42:57 +0100

Le 31/01/2025 à 00:19, David Hildenbrand a écrit :
On 30.01.25 22:36, Oscar Salvador wrote:
Hi,

last year Peter Xu did a presention at LSFM/MM on how to better 
integrate hugetlb
in the mm core.
There are several reasons we want to do that, but one could say that 
the two that
matter the most are 1) code duplication and 2) making hugetlb less 
special.

During the last year several patches that went in that direction were 
merged e.g:
gup hugetlb unify [1], mprotect for dax PUDs [2], hugetlb into generic 
unmapping
path [3] to name some.

There was also a concern on how to integrate hugetlb into the generic 
pagewalk,
getting rid by doing so of a lot of code and have a generic path that 
could handle
everything.
This was first worked in [4] (very basic draft).

Although a second version is on the works, I would like to present 
some concerns
I have wrt. that work.

Hi Oscar,

HugeTLB has its own way of dealing with things.
E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, 
huge_pte_clear_uffd_wp,
huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.

One of the challenges that this raises is that if we want pmd/pud 
walkers to
be able to make sense of hugetlb stuff, we need to implement pud/pmd
(maybe some pmd we already have because of THP) variants of those.

that's the easy case I'm afraid. The real problem are cont-pte 
constructs (or worse)
abstracted by hugetlb to be a single unit ("hugetlb pte").

For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays 
transparently
managed: you can modify any PTE part of the cont-gang and it will just
work as expected, transparently.

Not so with hugetlb, where you have to modify (or even query) the whole 
thing.

For GUP it was easier, because it was able to grab all information it 
needed
from the sub-ptes fairly easily, and it doesn't modify any page tabls.

I ran into this problem with folio_walk, and had to document it rather 
nastily:

  * WARNING: Modifying page table entries in hugetlb VMAs requires a lot 
of care.
  * For example, PMD page table sharing might require prior unsharing. 
Also,
  * logical hugetlb entries might span multiple physical page table 
entries,
  * which *must* be modified in a single operation (set_huge_pte_at(),
  * huge_ptep_set_*, ...). Note that the page table entry stored in @fw 
might
  * not correspond to the first physical entry of a logical hugetlb entry.

I wanted to use it to rewrite the uprobe code to also handle hugetlb with
less special casing, but that work stalled so far. I think my next 
attempt would rule
out any non-pmd / non-pud hugetlb pages to make it somewhat simpler.

It all gets weird with things like:

commit 0549e76663730235a10395a7af7ad3d3ce6e2402
Author: Christophe Leroy <christophe.leroy@xxxxxxxxxx>
Date:   Tue Jul 2 15:51:25 2024 +0200

     powerpc/8xx: rework support for 8M pages using contiguous PTE entries
     In order to fit better with standard Linux page tables layout, add 
support
     for 8M pages using contiguous PTE entries in a standard page 
table.  Page
     tables will then be populated with 1024 similar entries and two PMD
     entries will point to that page table.
     The PMD entries also get a flag to tell it is addressing an 8M 
page, this
     is required for the HW tablewalk assistance.

Where we are walking a PTE table, but actually there is another PTE 
table we
have to modify in the same go.

Very hard to make that non-hugetlb aware, as it's simply completely 
different compared
to ordinary page table walking/modifications today.

Maybe there are ideas to tackle that, and I'd be very interested in them.

But at least that 8xx change allowed us to get ride of huge page 
directories (hugepd) which was even more painful IIUC.

Neverthless, can't we turn that into a standard walk in a way or another ?

While we walk we reach a PMD entry which is marked as a CONT-PMD, but it 
is not tagged as a leaf entry, so there is a page table below. PMD_SIZE 
is 4M but the page size is 8M so once you've walked the page table 
entirely you know you still have 4M to go so you have to walk the second 
PMD and the page table it points to.

By the way, don't know it can help or make things worse, but indeed from 
a HW point of view there is no need to replicate 1024 times the PTE 
entry. Here we used a standard page table because it looked more generic 
from kernel point of view, but all the HW needs is a single PTE located 
at a page aligned address. Thats what we had when we used huge page 
directories (hugepd). It was even easier because both PMD entries were 
pointing to the same hugepd entry hence no need of CONT-PTE-like 
management at PTE level.

Christophe