Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk

David Hildenbrand <david@xxxxxxxxxx> · Tue, 4 Feb 2025 21:40:16 +0100

On 03.02.25 11:10, Oscar Salvador wrote:
On Fri, Jan 31, 2025 at 12:19:24AM +0100, David Hildenbrand wrote:
On 30.01.25 22:36, Oscar Salvador wrote:
Hi Oscar,

Hi David

Hi,

HugeTLB has its own way of dealing with things.
E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.

One of the challenges that this raises is that if we want pmd/pud walkers to
be able to make sense of hugetlb stuff, we need to implement pud/pmd
(maybe some pmd we already have because of THP) variants of those.

that's the easy case I'm afraid. The real problem are cont-pte constructs (or worse)
abstracted by hugetlb to be a single unit ("hugetlb pte").

Yes, that is a PITA to be honest.
E.g: Most of the code that lives in fs/proc/task_mmu.c works under that
assumption.
I was checking how we could make cont-{pmd,pte} work for hugetlb in that
area (just because it happens to be the first are I am trying to convert).
E.g: Let us have a look at smaps_pte_range/smaps_pte_entry.

I came up with something like the following (completely untested, just a
PoC) to make smaps_pte_range work for cont-ptes.
We would also need to change all the references of PAGE_SIZE to the
actual size of the folio (if there are).

Unfortunately not that easy. You can only have parts of a large folio 
mapped. It would be PAGE_SIZE * nr_ptes when batching.

But looking closer, folio_pte_batch is only declared in mm/internal.h,
so I am not sure I can make use of that outside mm/ realm?

Sure, we could export that. We'd need something similar for PMDs, and 
batching all other non-present stuff (e.g., migration entries)

If not, how could we work that around?

  diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
  index 193dd5f91fbf..21fb122ebcac 100644
  --- a/fs/proc/task_mmu.c
  +++ b/fs/proc/task_mmu.c
  @@ -802,7 +802,7 @@ static void mss_hugetlb_update(struct mem_size_stats *mss, struct folio *folio,
   #endif

   static void smaps_pte_entry(pte_t *pte, unsigned long addr,
  -		struct mm_walk *walk)
  +		struct mm_walk *walk, int nr_batch_entries)
   {
   	struct mem_size_stats *mss = walk->private;
   	struct vm_area_struct *vma = walk->vma;
  @@ -813,8 +813,19 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,

   	if (pte_present(ptent)) {
   		page = vm_normal_page(vma, addr, ptent);
  -		young = pte_young(ptent);
  -		dirty = pte_dirty(ptent);
  +		if (nr_batch_entries) {
  +			int nr;
  +			nr = folio_pte_batch(folio_page(page, 0), addr, pte,
  +					     ptent, max_nr, 0, NULL, &young,
  +					     &dirty);
  +			if (nr != nr_batch_entries) {
  +				walk->action = ACTION_AGAIN;
  +				return;
  +			}
  +		} else {
  +			young = pte_young(ptent);
  +			dirty = pte_dirty(ptent);
  +		}
>    		present = true;>    	} else if (is_swap_pte(ptent)) {
   		swp_entry_t swpent = pte_to_swp_entry(ptent);
  @@ -935,7 +946,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
   			   struct mm_walk *walk)
   {
   	struct vm_area_struct *vma = walk->vma;
  -	pte_t *pte;
  +	pte_t *pte, ptent;
   	spinlock_t *ptl;

   	ptl = pmd_huge_lock(pmd, vma);
  @@ -950,8 +961,21 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
   		walk->action = ACTION_AGAIN;
   		return 0;
   	}
  +
  +	ptent = ptep_get(pte);
  +	if (pte_present(ptent)) {
  +		struct folio *folio = vm_normal_folio(vma, addr, ptent);
  +		if (folio_test_large(folio)) {
  +			/* Let us use pte batching */
  +			smaps_pte_entry(pte, addr, walk, (end - addr) / PAGE_SIZE);
  +			pte_unmap_unlock(pte - 1, ptl);
  +
  +			return 0;
  +		}
  +	}
  +
   	for (; addr != end; pte++, addr += PAGE_SIZE)
  -		smaps_pte_entry(pte, addr, walk);
  +		smaps_pte_entry(pte, addr, walk, 0);
   	pte_unmap_unlock(pte - 1, ptl);
   out:
   	cond_resched();

I think this is the wrong approach. We should replace that pagewalk API 
usage by something better that hides all the batching.

The interface would look similar to folio_walk (no callbacks, handling 
of locking), but (a) work on ranges; (b) work also on non-folio entries; 
and (c) batch all suitable entries in the range.

Something like a pt_range_walk_start() that returns a "type" (folio 
range, migration entries, swap range, ...) + stores other details 
(range, level, ptep, ...) in a structure like folio_walk, to then 
provide mechanisms to continue (pt_walk_continue()) to walk or abort 
(pt_walk_done()) it. Similar to page_vma_mapped_walk(), but not specific 
to a given page/folio.

Then, we would simply process the output of that. With the hope that, 
for hugetlb it will just batch all cont-pte / cont-pmd entries into a 
single return value.

That will make the R/O walking as in task_mmu.c easier, hopefully.

Not so much with PTE/PMD modifications, like  damon_mkold_ops ... :( But 
maybe that just has to be special-cased for hugetlb, somehow ...

For "ordinary" pages, the cont-pte bit (as on arm64) is nowadays transparently
managed: you can modify any PTE part of the cont-gang and it will just
work as expected, transparently.

But that is because AFAIU, on arm64 it knows if it's dealing with a
cont-pte and it queries the status of the whole "gang", right?

Yes, that's how I remember it works on arm64.

Note that, though, that non-present PTEs don't use the cont bit. So one 
must "batch" these always manually.

I see that huge_ptep_get/set_huge_pte_at checks whether it's dealing
with cont-ptes and modifies/queries all sub-ptes.
How is that done for non-hugetlb pages?

Similarly, at least on arm64. When setting individual PTEs, it 
automatically sets the cont-pte bit once it detects that the applicable 
PTEs are all "cont-compatible" (e.g., belong to same folio, same 
protection bits). Other architectures might require more work (powrpc).

--
Cheers,

David / dhildenb