[RFC PATCH 0/3] Reduce dependence on vmas deep in hugetlb allocation code

Ackerley Tng <ackerleytng@xxxxxxxxxx> · Fri, 11 Oct 2024 23:22:35 +0000

I hope to use these 3 patches to start a discussion on eventually
removing the need to pass a struct vma pointer when taking a folio
from the global pool (i.e. dequeue_hugetlb_folio_vma()).

Why eliminate passing the struct vma pointer?

VMAs are more related to mapping into userspace, and it would be cleaner if the
HugeTLB folio allocation process could just focus on returning a folio.

Currently, the vma struct is a convenient struct that holds pieces of
information required in the allocation process, but dequeuing should not depend
on the VMA concept.

If the vma is needed deep in the allocation process, then allocation could
become awkward, such as in HugeTLBfs's fallocate, where there is no vma (yet)
and a pseudo-vma has to be created.

Separation will help with HugeTLB unification. Taking reference from the buddy
allocator, __alloc_pages_noprof() is conceptually separate from VMAs.

I started looking into this because we want to use HugeTLB folios in guest_memfd
[1], and then I found that the HugeTLB folio allocation process is tightly
coupled with VMAs. This makes it hard to use HugeTLB folios in guest_memfd,
which does not have VMAs for private pages.

Then, I watched Peter Xu's talk at LSFMM [2] about HugeTLB unifications and
thought that these patches could also contribute to the unification effort.

As discussed at LPC 2024 [3], the general preference is for guest_memfd to use
HugeTLB folios. While that is being worked out, I hope these patches can be
separately considered and merged. I believe the patches are still useful in
improving understandability of the resv_map/subpool/hstate reservation system in
HugeTLB, and there are no functionality changes intended.

---

Why use HugeTLB folios in guest_memfd?

HugeTLB is *the* source of 1G pages in the kernel today and it would be best for
all 1G page users (HugeTLB, HugeTLBfs, or guest_memfd) on a host to draw from
the same pool of 1G pages.

This allows central tracking of all 1G pages, a precious resource on a machine.

Having a separate 1G page allocator would not only require rebuilding
of features that HugeTLB has, but also cause a split 1G pool. If both
allocators are used on a machine, it would be complicated to

(a) predetermine how many pages to put in each allocator's pool or
(b) transfer pages between the pools at runtime.

---

[1] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@xxxxxxxxxx/T/
[2] https://youtu.be/7k-m2gTDu2k?si=ghWZ6qa1GAdaHOFP
[3] https://youtu.be/PVTjLLEpozE?si=HvdDlUc_4ElVXu5R

Ackerley Tng (3):
  mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
  mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
  mm: hugetlb: Remove unnecessary check for avoid_reserve

 mm/hugetlb.c | 57 +++++++++++++++++++++-------------------------------
 1 file changed, 23 insertions(+), 34 deletions(-)

--
2.47.0.rc1.288.g06298d1525-goog