Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday

David Hildenbrand <david@xxxxxxxxxx> · Thu, 15 Jun 2023 10:29:54 +0200

On 15.06.23 10:04, Michal Hocko wrote:
On Wed 14-06-23 16:04:58, Mike Kravetz wrote:
On 06/12/23 18:59, David Rientjes wrote:
This week's topic will be a technical brainstorming session on HugeTLB
convergence with the core MM.  This has been discussed most recently in
this thread:
https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@xxxxxxxxxxxxxxxxxxxx/T/

Thank you David for putting this session together!  And, thanks to everyone
who participated.

Following up on linux-mm with most active participants on Cc (sorry if I
missed someone).   If it makes more sense to continue the above thread,
please move there.

Even though everyone knows that hugetlb is special cased throughout the
core mm, it came to a head with the proposed introduction of HGM.  TBH,
few people in the core mm community paid much attention to HGM when first
introduced.  A LSF/MM session was then dedicated to the discussion of
HGM with the outcome being the suggestion to create a new filesystem/driver
(hugetlb2 if you will) that would satisfy the use cases requiring HGM.
One thing that was not emphasized at LSF/MM is that there are existing
hugetlb users experiencing major issues that could be addressed with HGM:
specifically the issues of memory errors and live migration.  That was
the starting point for recent discussion in the above thread.

I may be wrong, but it appeared the direction of that thread was to
first try and unify some of the hugetlb and core mm code.  Eliminate
some of the special casing.  If hugetlb was less of a special case, then
perhaps HGM would be more acceptable.  That is the impression I (perhaps
incorrectly) had going into today's session.

My impression from the discussion yesterday was that the level of
unification would need to be really large and time consuming in order to
be useful for the HGM patchset to be in a more maintainable form. The
final outcome is quite hard to predict at this stage.

During today's session, we often discussed what would/could be introduced
in a hugetlb v2.  The idea is that this would be the ideal place for HGM.
However, people also made the comparisons to cgroup v1 - v2.  Such a
redesign provides the needed 'clean slate' to do things right, but it
does little for existing users who would be unwilling to quickly move off
existing hugetlb.

We did spend a good chunk of time on hugetlb/core mm unification and
removing special casing.  In some (most) of these cases, the benefit of
removing special cases from core mm would result in adding more code to
hugetlb.  For example: proper type'ing so that hugetlb does not treat
all page table entries as PTEs.  Again, I may be wrong but I think
people were OK with adding more code (and even complexity) to hugetlb
if it eliminated special casing in the core mm.  But, there did not
seem to be a clear concensus especially with the thought that we may
need to double hugetlb code to get types right.

This is primarily your call as a maintainer. If you ask me, hugetlb is
over complicated in its current form already. Regression are not really
seldom when code is added which is a signal we are hitting maintenance
cost walls. This doesn't mean further development is impossible of
course but it is increasingly more costly AFAICS.

Unless I missed something, there was no clear direction at the end of this
session.  I was hoping that we could come up with a plan to address the
issues facing today's hugetlb users.  IMO, there seems to be two options:
1) Start work on hugetlb v2 with the intention that customers will need
    to move to this to address their issues.
2) Incorporate functionality like HGM into existing hugetlb.

I fully agree with all that Michal said.

I'm just going to add that I don't see why anyone would look into a 
hugetlbv2 if we're going to use the motivation of "help existing users" 
to make hugetlb ever-more complicated and special. "existing users" her 
even meaning "people use hugetlb for backing VMs. Now they want to get 
postcopy working with less latency." -- which I consider partially a new 
use case.

So working on adding HGM and concurrently starting a hugetlbv2? I don't 
think that will happen if we decide on adding HGM and proceeding with 
that reasoning about existing users.

As expressed yesterday, I don't see a fast an clean way to make hugetlb 
significantly less special (thanks Willy for the list of odd cases).

Sure, we can talk about adding pte_t safety, but I don't really see a 
way forward to unify page table walking code that way -- there are still 
the (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if 
anybody wants to work on that, why not.

Having that said, like Michal, I acknowledge that it is Mikes call 
regarding the hugetlb code. I, for my part, will push back on any added 
core-mm complexity that adds more special casing for hugetlb. Maybe 
there are easy ways to integrate it nicely and that is not really a concern.

Note that while we've been discussing how HGM would already interfere 
with core-mm, we've not even started discussing how actual 
MADV_SPLIT/MADV_COLLAPSE/page poisioning ... would affect core-mm and 
require special-casing for hugetlb.

I, for my part, will explore a bit the mapcount topic (as time permits) 
and see if we can come up at least with a unified mapcount approach 
(e.g., sub-page mapcount?). But I suspect even figuring that out will 
take quite a while already ...

--
Cheers,

David / dhildenb