Le 31/01/2025 à 00:19, David Hildenbrand a écrit :
On 30.01.25 22:36, Oscar Salvador wrote:
Hi,
last year Peter Xu did a presention at LSFM/MM on how to better
integrate hugetlb
in the mm core.
There are several reasons we want to do that, but one could say that
the two that
matter the most are 1) code duplication and 2) making hugetlb less
special.
During the last year several patches that went in that direction were
merged e.g:
gup hugetlb unify [1], mprotect for dax PUDs [2], hugetlb into generic
unmapping
path [3] to name some.
There was also a concern on how to integrate hugetlb into the generic
pagewalk,
getting rid by doing so of a lot of code and have a generic path that
could handle
everything.
This was first worked in [4] (very basic draft).
Although a second version is on the works, I would like to present
some concerns
I have wrt. that work.
Hi Oscar,
HugeTLB has its own way of dealing with things.
E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp,
huge_pte_clear_uffd_wp,
huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.
One of the challenges that this raises is that if we want pmd/pud
walkers to
be able to make sense of hugetlb stuff, we need to implement pud/pmd
(maybe some pmd we already have because of THP) variants of those.
that's the easy case I'm afraid. The real problem are cont-pte
constructs (or worse)
abstracted by hugetlb to be a single unit ("hugetlb pte").
For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays
transparently
managed: you can modify any PTE part of the cont-gang and it will just
work as expected, transparently.
Not so with hugetlb, where you have to modify (or even query) the whole
thing.
For GUP it was easier, because it was able to grab all information it
needed
from the sub-ptes fairly easily, and it doesn't modify any page tabls.
I ran into this problem with folio_walk, and had to document it rather
nastily:
* WARNING: Modifying page table entries in hugetlb VMAs requires a lot
of care.
* For example, PMD page table sharing might require prior unsharing.
Also,
* logical hugetlb entries might span multiple physical page table
entries,
* which *must* be modified in a single operation (set_huge_pte_at(),
* huge_ptep_set_*, ...). Note that the page table entry stored in @fw
might
* not correspond to the first physical entry of a logical hugetlb entry.
I wanted to use it to rewrite the uprobe code to also handle hugetlb with
less special casing, but that work stalled so far. I think my next
attempt would rule
out any non-pmd / non-pud hugetlb pages to make it somewhat simpler.
It all gets weird with things like:
commit 0549e76663730235a10395a7af7ad3d3ce6e2402
Author: Christophe Leroy <christophe.leroy@xxxxxxxxxx>
Date: Tue Jul 2 15:51:25 2024 +0200
powerpc/8xx: rework support for 8M pages using contiguous PTE entries
In order to fit better with standard Linux page tables layout, add
support
for 8M pages using contiguous PTE entries in a standard page
table. Page
tables will then be populated with 1024 similar entries and two PMD
entries will point to that page table.
The PMD entries also get a flag to tell it is addressing an 8M
page, this
is required for the HW tablewalk assistance.
Where we are walking a PTE table, but actually there is another PTE
table we
have to modify in the same go.
Very hard to make that non-hugetlb aware, as it's simply completely
different compared
to ordinary page table walking/modifications today.
Maybe there are ideas to tackle that, and I'd be very interested in them.
But at least that 8xx change allowed us to get ride of huge page
directories (hugepd) which was even more painful IIUC.
Neverthless, can't we turn that into a standard walk in a way or another ?
While we walk we reach a PMD entry which is marked as a CONT-PMD, but it
is not tagged as a leaf entry, so there is a page table below. PMD_SIZE
is 4M but the page size is 8M so once you've walked the page table
entirely you know you still have 4M to go so you have to walk the second
PMD and the page table it points to.
By the way, don't know it can help or make things worse, but indeed from
a HW point of view there is no need to replicate 1024 times the PTE
entry. Here we used a standard page table because it looked more generic
from kernel point of view, but all the HW needs is a single PTE located
at a page aligned address. Thats what we had when we used huge page
directories (hugepd). It was even easier because both PMD entries were
pointing to the same hugepd entry hence no need of CONT-PTE-like
management at PTE level.
Christophe