Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs

David Rientjes <rientjes@xxxxxxxxxx> · Wed, 7 Jun 2023 17:02:37 -0700 (PDT)

On Wed, 7 Jun 2023, Mike Kravetz wrote:

> > > > > Are there strong objections to extending hugetlb for this support?
> > > > 
> > > > I don't want to get too involved in this discussion (busy), but I
> > > > absolutely agree on the points that were raised at LSF/MM that
> > > > 
> > > > (A) hugetlb is complicated and very special (many things not integrated
> > > > with core-mm, so we need special-casing all over the place). [example:
> > > > what is a pte?]
> > > > 
> > > > (B) We added a bunch of complexity in the past that some people
> > > > considered very important (and it was not feature frozen, right? ;) ).
> > > > Looking back, we might just not have done some of that, or done it
> > > > differently/cleaner -- better integrated in the core. (PMD sharing,
> > > > MAP_PRIVATE, a reservation mechanism that still requires preallocation
> > > > because it fails with NUMA/fork, ...)
> > > > 
> > > > (C) Unifying hugetlb and the core looks like it's getting more and more
> > > > out of reach, maybe even impossible with all the complexity we added
> > > > over the years (well, and keep adding).
> > > > 
> > > > Sure, HGM for the purpose of better hwpoison handling makes sense. But
> > > > hugetlb is probably 20 years old and hwpoison handling probably 13 years
> > > > old. So we managed to get quite far without that optimization.
> > > > 

Sane handling for memory poisoning and optimizations for live migration 
are both much more important for the real-world 1GB hugetlb user, so it 
doesn't quite have that lengthy of a history.

Unfortuantely, cloud providers receive complaints about both of these from 
customers.  They are one of the most significant causes for poor customer 
experience.

While people have proposed 1GB THP support in the past, it was nacked, in 
part, because of the suggestion to just use existing 1GB support in 
hugetlb instead :)

> > > > Absolutely, HGM for better postcopy live migration also makes sense, I
> > > > guess nobody disagrees on that.
> > > > 
> > > > 
> > > > But as discussed in that session, maybe we should just start anew and
> > > > implement something that integrates nicely with the core , instead of
> > > > making hugetlb more complicated and even more special.
> > > > 

Certainly an ideal would be where we could support everybody's use cases 
in a much more cohesive way with the rest of the core MM.  I'm 
particularly concerned about how long it will take to get to that state 
even if we had kernel developers committed to doing the work.  Even if we 
had a design for this new subsystem that was more tightly coupled with the 
core MM, it would take O(years) to implement, test, extend for other 
architectures, and that's before any existing of users of hugetlb could 
make the changes in the rest of their software stack to support it.

We have no other solution today for 1GB support in Linux, so waiting 
O(years) for this yet-to-be-designed future *is* going to cause 
compounding customer pain in the real world.

> > > > Now, we all know, nobody wants to do the heavy lifting for that, that's
> > > > why we're discussing how to get in yet another complicated feature.
> > > 
> > > If nobody wants to do the heavy lifting and unifying hugetlb with core
> > > MM is becoming impossible as you state, then does adding another
> > > feature to hugetlb (that we are all agreeing is useful for multiple
> > > use cases) really making things worse? In other words, if someone
> > 
> > Well, if we (as a community) reject more complexity and outline an
> > alternative of what would be acceptable (rewrite), people that really want
> > these new features will *have to* do the heavy lifting.
> > 
> > [and I see many people from employers that might have the capacity to do the
> > heavy lifting if really required being involved in the discussion around HGM
> > :P ]
> > 
> > > decides tomorrow to do the heavy lifting, how much harder does this
> > > become because of HGM, if any?
> > > 
> > > I am the farthest away from being an expert here, I am just an
> > > observer here, but if the answer to the above question is "HGM doesn't
> > > actually make it worse" or "HGM only slightly makes things harder",
> > > then I naively think that it's something that we should do, from a
> > > pure cost-benefit analysis.
> > 
> > Well, there is always the "maintainability" aspect, because upstream has to
> > maintain whatever complexity gets merged. No matter what, we'll have to keep
> > maintaining the current set of hugetlb features until we can eventually
> > deprecate it/some in the far, far future.
> > 
> > I, for my part, am happy as long as I can stay away as far as possible from
> > hugetlb code. Again, Mike is the maintainer.
> 
> Thanks for the reminder :)
> 
> Maintainability is my primary concern with HGM.  That is one of the reasons
> I proposed James pitch the topic at LSFMM.  Even though I am the 'maintainer'
> changes introduced by HGM will impact others working in mm.
> 
> > What I saw so far regarding HGM does not count as "slightly makes things
> > harder".
> > 
> > > Again, I don't have a lot of context here, and I understand everyone's
> > > frustration with the current state of hugetlb. Just my 2 cents.
> > 
> > The thing is, we all agree that something that hugetlb provides is valuable
> > (i.e., pool of huge/large pages that we can map large), just that after 20
> > years there might be better ways of doing it and integrating it better with
> > core-mm.
> 
> I am struggling with how to support existing hugetlb users that are running
> into issues like memory errors on hugetlb pages today.  And, yes that is a
> source of real customer issues.  They are not really happy with the current
> design that a single error will take out a 1G page, and their VM or
> application.  Moving to THP is not likely as they really want a pre-allocated
> pool of 1G pages.  I just don't have a good answer for them.

Fully agreed, these customer complaints are a very real and significant 
problem that is actively causing pain today for 1GB users.  That can't be 
understated.  Same for the user who is live migrated because of a 
disruptive software update on the host.

We would very much like a future where the hugetlb subsystem is more 
closely integrated with the core mm just because of subtle bugs that have 
popped up over time in hugetlb, including very complex reservation code.  
We've funded an initiative around hugetlb reliability because of a 
critical dependency on the subsystem as the *only* way to support 1GB 
mappings.

Don't get me wrong: integration with core mm is very beneficial from a 
reliability and maintenance perspective.  I just don't think the right 
solution is to mandate O(years) of work *before* we can possibly stop the 
very real customer pain.