Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Wed, 7 Jun 2023 15:06:51 -0700

On 06/07/23 10:13, David Hildenbrand wrote:
> On 07.06.23 09:51, Yosry Ahmed wrote:
> > On Wed, Jun 7, 2023 at 12:38 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> > > 
> > > On 07.06.23 00:40, David Rientjes wrote:
> > > > On Fri, 2 Jun 2023, Mike Kravetz wrote:
> > > > 
> > > > > The benefit of HGM in the case of memory errors is fairly obvious.  As
> > > > > mentioned above, when a memory error is encountered on a hugetlb page,
> > > > > that entire hugetlb page becomes inaccessible to the application.  Losing,
> > > > > 1G or even 2M of data is often catastrophic for an application.  There
> > > > > is often no way to recover.  It just makes sense that recovering from
> > > > > the loss of 4K of data would generally be easier and more likely to be
> > > > > possible.  Today, when Oracle DB encounters a hard memory error on a
> > > > > hugetlb page it will shutdown.  Plans are currently in place repair and
> > > > > recover from such errors if possible.  Isolating the area of data loss
> > > > > to a single 4K page significantly increases the likelihood of repair and
> > > > > recovery.
> > > > > 
> > > > > Today, when a memory error is encountered on a hugetlb page an
> > > > > application is 'notified' of the error by a SIGBUS, as well as the
> > > > > virtual address of the hugetlb page and it's size.  This makes sense as
> > > > > hugetlb pages are accessed by a single page table entry, so you get all
> > > > > or nothing.  As mentioned by James above, this is catastrophic for VMs
> > > > > as the hypervisor has just been told that 2M or 1G is now inaccessible.
> > > > > With HGM, we can isolate such errors to 4K.
> > > > > 
> > > > > Backing VMs with hugetlb pages is a real use case today.  We are seeing
> > > > > memory errors on such hugetlb pages with the result being VM failures.
> > > > > One of the advantages of backing VMs with THPs is that they are split in
> > > > > the case of memory errors.  HGM would allow similar functionality.
> > > > 
> > > > Thanks for this context, Mike, it's very useful.
> > > > 
> > > > I think everybody is aligned on the desire to map memory at smaller
> > > > granularities for multiple use cases and it's fairly clear that these use
> > > > cases are critically important to multiple stakeholders.
> > > > 
> > > > I think the open question is whether this functionality is supported in
> > > > hugetlbfs (like with HGM) or that there is a hard requirement that we must
> > > > use THP for this support.
> > > > 
> > > > I don't think that hugetlbfs is feature frozen, but if there's a strong
> > > > bias toward not merging additional complexity into the subsystem that
> > > > would useful to know.  I personally think the critical use cases described
> > > 
> > > At least I, attending that session, thought that it was clear that the
> > > majority of the people speaking up clearly expressed "no more added
> > > complexity". So I think there is a clear strong bias, at least from the
> > > people attending that session.
> > > 
> > > 
> > > > above justify the added complexity of HGM to hugetlb and we wouldn't be
> > > > blocked by the long standing (15+ years) desire to mesh hugetlb into the
> > > > core MM subsystem before we can stop the pain associated with memory
> > > > poisoning and live migration.
> > > > 
> > > > Are there strong objections to extending hugetlb for this support?
> > > 
> > > I don't want to get too involved in this discussion (busy), but I
> > > absolutely agree on the points that were raised at LSF/MM that
> > > 
> > > (A) hugetlb is complicated and very special (many things not integrated
> > > with core-mm, so we need special-casing all over the place). [example:
> > > what is a pte?]
> > > 
> > > (B) We added a bunch of complexity in the past that some people
> > > considered very important (and it was not feature frozen, right? ;) ).
> > > Looking back, we might just not have done some of that, or done it
> > > differently/cleaner -- better integrated in the core. (PMD sharing,
> > > MAP_PRIVATE, a reservation mechanism that still requires preallocation
> > > because it fails with NUMA/fork, ...)
> > > 
> > > (C) Unifying hugetlb and the core looks like it's getting more and more
> > > out of reach, maybe even impossible with all the complexity we added
> > > over the years (well, and keep adding).
> > > 
> > > Sure, HGM for the purpose of better hwpoison handling makes sense. But
> > > hugetlb is probably 20 years old and hwpoison handling probably 13 years
> > > old. So we managed to get quite far without that optimization.
> > > 
> > > Absolutely, HGM for better postcopy live migration also makes sense, I
> > > guess nobody disagrees on that.
> > > 
> > > 
> > > But as discussed in that session, maybe we should just start anew and
> > > implement something that integrates nicely with the core , instead of
> > > making hugetlb more complicated and even more special.
> > > 
> > > 
> > > Now, we all know, nobody wants to do the heavy lifting for that, that's
> > > why we're discussing how to get in yet another complicated feature.
> > 
> > If nobody wants to do the heavy lifting and unifying hugetlb with core
> > MM is becoming impossible as you state, then does adding another
> > feature to hugetlb (that we are all agreeing is useful for multiple
> > use cases) really making things worse? In other words, if someone
> 
> Well, if we (as a community) reject more complexity and outline an
> alternative of what would be acceptable (rewrite), people that really want
> these new features will *have to* do the heavy lifting.
> 
> [and I see many people from employers that might have the capacity to do the
> heavy lifting if really required being involved in the discussion around HGM
> :P ]
> 
> > decides tomorrow to do the heavy lifting, how much harder does this
> > become because of HGM, if any?
> > 
> > I am the farthest away from being an expert here, I am just an
> > observer here, but if the answer to the above question is "HGM doesn't
> > actually make it worse" or "HGM only slightly makes things harder",
> > then I naively think that it's something that we should do, from a
> > pure cost-benefit analysis.
> 
> Well, there is always the "maintainability" aspect, because upstream has to
> maintain whatever complexity gets merged. No matter what, we'll have to keep
> maintaining the current set of hugetlb features until we can eventually
> deprecate it/some in the far, far future.
> 
> I, for my part, am happy as long as I can stay away as far as possible from
> hugetlb code. Again, Mike is the maintainer.

Thanks for the reminder :)

Maintainability is my primary concern with HGM.  That is one of the reasons
I proposed James pitch the topic at LSFMM.  Even though I am the 'maintainer'
changes introduced by HGM will impact others working in mm.

> What I saw so far regarding HGM does not count as "slightly makes things
> harder".
> 
> > Again, I don't have a lot of context here, and I understand everyone's
> > frustration with the current state of hugetlb. Just my 2 cents.
> 
> The thing is, we all agree that something that hugetlb provides is valuable
> (i.e., pool of huge/large pages that we can map large), just that after 20
> years there might be better ways of doing it and integrating it better with
> core-mm.

I am struggling with how to support existing hugetlb users that are running
into issues like memory errors on hugetlb pages today.  And, yes that is a
source of real customer issues.  They are not really happy with the current
design that a single error will take out a 1G page, and their VM or
application.  Moving to THP is not likely as they really want a pre-allocated
pool of 1G pages.  I just don't have a good answer for them.
-- 
Mike Kravetz