Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





Le 31/01/2025 à 00:19, David Hildenbrand a écrit :
On 30.01.25 22:36, Oscar Salvador wrote:
Hi,

last year Peter Xu did a presention at LSFM/MM on how to better integrate hugetlb
in the mm core.
There are several reasons we want to do that, but one could say that the two that matter the most are 1) code duplication and 2) making hugetlb less special.

During the last year several patches that went in that direction were merged e.g: gup hugetlb unify [1], mprotect for dax PUDs [2], hugetlb into generic unmapping
path [3] to name some.

There was also a concern on how to integrate hugetlb into the generic pagewalk, getting rid by doing so of a lot of code and have a generic path that could handle
everything.
This was first worked in [4] (very basic draft).

Although a second version is on the works, I would like to present some concerns
I have wrt. that work.

Hi Oscar,


HugeTLB has its own way of dealing with things.
E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.

One of the challenges that this raises is that if we want pmd/pud walkers to
be able to make sense of hugetlb stuff, we need to implement pud/pmd
(maybe some pmd we already have because of THP) variants of those.

that's the easy case I'm afraid. The real problem are cont-pte constructs (or worse)
abstracted by hugetlb to be a single unit ("hugetlb pte").

For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays transparently
managed: you can modify any PTE part of the cont-gang and it will just
work as expected, transparently.

Not so with hugetlb, where you have to modify (or even query) the whole thing.

For GUP it was easier, because it was able to grab all information it needed
from the sub-ptes fairly easily, and it doesn't modify any page tabls.


I ran into this problem with folio_walk, and had to document it rather nastily:

 * WARNING: Modifying page table entries in hugetlb VMAs requires a lot of care.  * For example, PMD page table sharing might require prior unsharing. Also,  * logical hugetlb entries might span multiple physical page table entries,
  * which *must* be modified in a single operation (set_huge_pte_at(),
 * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
  * not correspond to the first physical entry of a logical hugetlb entry.

I wanted to use it to rewrite the uprobe code to also handle hugetlb with
less special casing, but that work stalled so far. I think my next attempt would rule
out any non-pmd / non-pud hugetlb pages to make it somewhat simpler.

It all gets weird with things like:

commit 0549e76663730235a10395a7af7ad3d3ce6e2402
Author: Christophe Leroy <christophe.leroy@xxxxxxxxxx>
Date:   Tue Jul 2 15:51:25 2024 +0200

     powerpc/8xx: rework support for 8M pages using contiguous PTE entries
    In order to fit better with standard Linux page tables layout, add support     for 8M pages using contiguous PTE entries in a standard page table.  Page
     tables will then be populated with 1024 similar entries and two PMD
     entries will point to that page table.
    The PMD entries also get a flag to tell it is addressing an 8M page, this
     is required for the HW tablewalk assistance.

Where we are walking a PTE table, but actually there is another PTE table we
have to modify in the same go.


Very hard to make that non-hugetlb aware, as it's simply completely different compared
to ordinary page table walking/modifications today.

Maybe there are ideas to tackle that, and I'd be very interested in them.



But at least that 8xx change allowed us to get ride of huge page directories (hugepd) which was even more painful IIUC.

Neverthless, can't we turn that into a standard walk in a way or another ?

While we walk we reach a PMD entry which is marked as a CONT-PMD, but it is not tagged as a leaf entry, so there is a page table below. PMD_SIZE is 4M but the page size is 8M so once you've walked the page table entirely you know you still have 4M to go so you have to walk the second PMD and the page table it points to.

By the way, don't know it can help or make things worse, but indeed from a HW point of view there is no need to replicate 1024 times the PTE entry. Here we used a standard page table because it looked more generic from kernel point of view, but all the HW needs is a single PTE located at a page aligned address. Thats what we had when we used huge page directories (hugepd). It was even easier because both PMD entries were pointing to the same hugepd entry hence no need of CONT-PTE-like management at PTE level.

Christophe




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux