Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday

James Houghton <jthoughton@xxxxxxxxxx> · Thu, 15 Jun 2023 10:00:41 -0700

On Thu, Jun 15, 2023 at 1:04 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Wed 14-06-23 16:04:58, Mike Kravetz wrote:
> > On 06/12/23 18:59, David Rientjes wrote:
> > > This week's topic will be a technical brainstorming session on HugeTLB
> > > convergence with the core MM.  This has been discussed most recently in
> > > this thread:
> > > https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@xxxxxxxxxxxxxxxxxxxx/T/
> >
> > Thank you David for putting this session together!  And, thanks to everyone
> > who participated.
> >
> > Following up on linux-mm with most active participants on Cc (sorry if I
> > missed someone).   If it makes more sense to continue the above thread,
> > please move there.
> >
> > Even though everyone knows that hugetlb is special cased throughout the
> > core mm, it came to a head with the proposed introduction of HGM.  TBH,
> > few people in the core mm community paid much attention to HGM when first
> > introduced.  A LSF/MM session was then dedicated to the discussion of
> > HGM with the outcome being the suggestion to create a new filesystem/driver
> > (hugetlb2 if you will) that would satisfy the use cases requiring HGM.
> > One thing that was not emphasized at LSF/MM is that there are existing
> > hugetlb users experiencing major issues that could be addressed with HGM:
> > specifically the issues of memory errors and live migration.  That was
> > the starting point for recent discussion in the above thread.
> >
> > I may be wrong, but it appeared the direction of that thread was to
> > first try and unify some of the hugetlb and core mm code.  Eliminate
> > some of the special casing.  If hugetlb was less of a special case, then
> > perhaps HGM would be more acceptable.  That is the impression I (perhaps
> > incorrectly) had going into today's session.
>
> My impression from the discussion yesterday was that the level of
> unification would need to be really large and time consuming in order to
> be useful for the HGM patchset to be in a more maintainable form. The
> final outcome is quite hard to predict at this stage.

I also had this impression, but some of the unification efforts are
pretty independent of HGM (like the PTE/PMD/PUD typing idea). It
doesn't really change HGM all that much. My understanding is like: do
some general unification first, then we could take HGM. HGM is still
going to mostly look the same.

>
> > During today's session, we often discussed what would/could be introduced
> > in a hugetlb v2.  The idea is that this would be the ideal place for HGM.
> > However, people also made the comparisons to cgroup v1 - v2.  Such a
> > redesign provides the needed 'clean slate' to do things right, but it
> > does little for existing users who would be unwilling to quickly move off
> > existing hugetlb.
> >
> > We did spend a good chunk of time on hugetlb/core mm unification and
> > removing special casing.  In some (most) of these cases, the benefit of
> > removing special cases from core mm would result in adding more code to
> > hugetlb.  For example: proper type'ing so that hugetlb does not treat
> > all page table entries as PTEs.  Again, I may be wrong but I think
> > people were OK with adding more code (and even complexity) to hugetlb
> > if it eliminated special casing in the core mm.  But, there did not
> > seem to be a clear concensus especially with the thought that we may
> > need to double hugetlb code to get types right.
>
> This is primarily your call as a maintainer. If you ask me, hugetlb is
> over complicated in its current form already. Regression are not really
> seldom when code is added which is a signal we are hitting maintenance
> cost walls. This doesn't mean further development is impossible of
> course but it is increasingly more costly AFAICS.
>
> > Unless I missed something, there was no clear direction at the end of this
> > session.  I was hoping that we could come up with a plan to address the
> > issues facing today's hugetlb users.  IMO, there seems to be two options:
> > 1) Start work on hugetlb v2 with the intention that customers will need
> >    to move to this to address their issues.
> > 2) Incorporate functionality like HGM into existing hugetlb.
>
> From the memcg experience I can tell that cleaning up interfaces and
> supported scenarios helped a lot. We still have to maintain v1 and will
> have to for a foreseeable future and beyond but it is much more easier
> to build a new functionality on top v2 without struggling how to hammer
> that into v1 which is much more easier to generate corner cases.
>
> I fully understand that this is not really a great approach for users
> from the short term POV because they have to adapt but I am pretty sure
> that they would also appreciate long term stable and regression free
> code.
>
> It is my understanding that most HGM users do not really benefit from
> most of hugetlb features (shared page tables, reservation code etc) and
> from that POV it makes some sense to start from a simpler code base.
>
> I do recognize there are other users who would like their existing
> hugetlb setups keep working and benefit from a better support - page
> poisoning comes to mind but is this really that easy without
> considerable changes on the user space side?
>
> Memory error recovery problems are a really tough to deal with
> AFAICS. Do we have any references to an existing userspace that would
> able to deal with holes in their hugetlb pages? Or is this a chicken&egg
> problem? Have we exhausted potential (even if coarse) solutions that
> wouldn't require breaking hugetlb page tables?

For VMs, having a 4K hole in the hugetlb page is how Google emulates
memory poison. (We sorta get what we need with a terrible hack to KVM.
HGM is better in every way.)

There aren't a lot of userspace changes to make this work. It comes down to:
1. Enlighten the vCPU threads (the ones that run KVM_RUN) to handle
the MCEERR SIGBUSes, and inject MCEs into the guest when this happens.
2. Optionally enlighten any other routines that have a substantial
likelihood to read poisoned memory (for example, live migration
routines), and skip the poisoned pieces (by adjusting the program
counter in the SIGBUS handler).

This isn't theoretical and is implemented today, but I can't easily
show you how Google does it. QEMU does #1 like this[1]. Unmapping in
this case is important; we can't allow the guest to be able to trigger
MCEs at will, as they could easily force the host to crash.

khugepaged was recently updated[2] to better recover from poison, so
VMs that consumed an error because khugepaged (in the guest) was
collapsing memory will be able to legitimately recover.

I can't speak in detail about how databases recover from memory
poison. Mike, maybe you can share more details?

[1]: https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L649
[2]: https://lore.kernel.org/linux-mm/20230329151121.949896-1-jiaqiyan@xxxxxxxxxx/