Re: [LSF/MM/BPF TOPIC] HugeTLB generic pagewalk

Peter Xu <peterx@xxxxxxxxxx> · Thu, 30 Jan 2025 17:45:39 -0500

On Thu, Jan 30, 2025 at 10:36:51PM +0100, Oscar Salvador wrote:
> Hi,

Hello, Oscar,

> 
> last year Peter Xu did a presention at LSFM/MM on how to better integrate hugetlb
> in the mm core.
> There are several reasons we want to do that, but one could say that the two that
> matter the most are 1) code duplication and 2) making hugetlb less special.
> 
> During the last year several patches that went in that direction were merged e.g:
> gup hugetlb unify [1], mprotect for dax PUDs [2], hugetlb into generic unmapping
> path [3] to name some.
> 
> There was also a concern on how to integrate hugetlb into the generic pagewalk,
> getting rid by doing so of a lot of code and have a generic path that could handle
> everything. 
> This was first worked in [4] (very basic draft).
> 
> Although a second version is on the works, I would like to present some concerns
> I have wrt. that work.
> 
> HugeTLB has its own way of dealing with things.
> E.g: HugeTLB interprets everything as a pte: huge_pte_uffd_wp, huge_pte_clear_uffd_wp,
> huge_pte_dirty, huge_pte_modify, huge_pte_wrprotect etc.
> 
> One of the challenges that this raises is that if we want pmd/pud walkers to
> be able to make sense of hugetlb stuff, we need to implement pud/pmd
> (maybe some pmd we already have because of THP) variants of those.
> 
> E.g: HugeTLB code uses is_swap_pte and pte_to_swp_entry.
> If we want PUD walkers to be able to handle hugetlb, this means that we would
> need some sort of is_swap_pud and pud_to_swp_entry implementations.
> The same happens with a handful of other functions (e.g: huge_pte_*_uffd_wp,
> hugetlb pte markers, etc.)
> 
> This has never been a problem because hugetlb has its way of doing things
> and we implemented code around that logic, but this falls off the cliff as
> soon as we want to make it less special and more generic, because we need to
> start implementing all those pte_* variants for pud/pmd_* 
> 
> I would like to know how people feel about it, whether this is something worth
> pursuing, or we live with the fact that HugeTLB is special, and so it remains
> this way.
> 
> [1]
> https://patchwork.kernel.org/project/linux-mm/cover/20240327152332.950956-1-peterx@xxxxxxxxxx/
> [2]
> https://patchwork.kernel.org/project/linux-mm/cover/20240812181225.1360970-1-peterx@xxxxxxxxxx/
> [3]
> https://patchwork.kernel.org/project/linux-mm/cover/20241007075037.267650-1-osalvador@xxxxxxx/
> [4]
> https://patchwork.kernel.org/project/linux-mm/cover/20240704043132.28501-1-osalvador@xxxxxxx/

Thanks for bringing up this topic.

I won't be able to try to apply for lsfmm this year due to some family
plan, but I can share some quick thoughts in case if it's anything helpful.
We also had some relevant discussions on this, but I guess most of them are
not on the list.

I definitely agree with you that such cleanup on hugetlb would be always
nice especially on the pgtable side. Fundamentally, it's because huge
mappings in the pgtables don't have any real difference between hugetlbfs
or other forms of it, at least on the known archs I'm aware of.  In
general, the pgtable part only defines the size of a mapping not the
attributes.

I have also once shared with you on the concern I had with the work: not
only we've got limited resources on developers who would be willing to do
this cleanup, but also whoever will be able to review it properly.  In
general, knowing that hugetlbfs can be feature-freeze after the HGM
attempt, I start to evaluate the pros and cons of such global cleanup, and
whether it'll pay-off for everyone, with the risk of easily break any
existing hugetlbfs users.

I won't be surprised that such whole effort should take at least 5-digits
LOCs change to be complete, even if we assume the idea is 100% workable,
and 100% perfect code, which can still be quite some effort.

The gain of such whole work would be having clean code base for pgtable,
with no functional change (hopefully!  unless we regress some perf here and
there.. normally hugetlb API is _slightly_ faster.. even if uglier, per my
"can-be-outdated" impression..).

So that's a major concern I have, on whether we should stick with clean
everything up, or thinking about other approaches, e.g.:

  - We could still do the low hanging fruits if we see fit, that are self
    contained, and have direct benefit.  E.g. I think maybe it still makes
    sense to finish your page walk API rewrites at least if it's already
    half way through (which is my gut feeling, but you know the best..).

  - We could think about refactoring hugetlb in a way that we could make it
    more usable and provide new features, rather than reworking on a
    feature-freeze base idea so we can't get more than "cleanups" only.

The latter is also why I started looking at integrating HugeTLB pages /
folios without hugetlbfs's presence. So far gmem does look like a good
container out of it, as confidential computing will have similar demand to
allocate 1G pages out of somewhere, and that "somewhere" shares a lot of
common issues to be resolved by hugetlbfs as well.  It means it makes sense
to me to rework that part of hugetlbfs to suite more consumers (which I
start to call them "hugetlb pages / folios" v.s. "hugetlbfs" just to
differenciate from the file system).

And if gmem 1G can work with CoCo, it can be pretty simple to extend that
to !CoCo (which is fundamentally in-place consuming gmem folios with no
need to do private<->shared conversions), which means there's chance for a
VM cloud provider (private or public) move over to gmem completely
replacing hugetlbfs 1G, for either confidential or normal VMs.

Then there's a hugetlb-based (not hugetlbfs-based) solution that is not
feature-freeze, and meanwhile whatever we rework on hugetlb from this
perspective will not only be cleaups, but paving way for anything to be
built on top of hugetlb folios to work properly.

I don't think it's justified we shouldn't keep cleaning hugetlbfs code,
though.  So there's still the 3rd option that we could choose to try finish
this work.  It'll just be challenging from different aspects.

Sorry for all these pretty random thoughts.  Please ignore them if some (if
not all..) doesn't apply for the topic you plan to discuss!

Thanks,

-- 
Peter Xu