Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Fri, 9 Jun 2023 20:57:01 +0100

On Thu, Jun 08, 2023 at 09:57:34PM -0400, Zi Yan wrote:
> On the hugetlbfs backend, PMD sharing, MAP_PRIVATE, reducing struct page
> storage all look features core mm might want. Merging these features back
> to core mm might be a good first step.
> 
> I thought about replacing hugetlbfs backend with THP (with my 1GB THP support),
> but find that not all THP features are necessary for hugetlbfs users or
> compatible with existing hugetlbfs. For example, hugetlbfs does not need
> transparent page split, since user just wants that big page size. And page
> split might not get along with reducing struct page storage feature.

But with HGM, we actually do want to split the page because part of it
has hit a hwpoison event.  What these customers don't need is support
for misaligned mappings or partial mappings.  If they map a 1GB page,
they do it 1GB aligned and in multiples of 1GB.  And they tell us in
advance that's what they're doing.

> In sum, I think we might not need all THP features (page table entry split
> and huge page split) to replace hugetlbfs and we might just need to enable
> core mm to handle any size folio and hugetlb pages are just folios that
> can go as large as 1GB. As a result, hugetlb pages can take advantage of
> all core mm features, like hwpoison.

Yes, this is more or less in line with my work.  And yet there are still
problems to solve:

 - mapcount (discussed elsewhere in the thread)
 - page cache index scaling (Sid is working on this)
 - page table sharing (mshare)
 - reserved memory

> > I seem to remember Zi trying to use CMA for 1G THP allocations.  However, I
> > am not sure if using CMA would be sufficient.  IIUC, allocating from CMA could
> > still require page migrations to put together a 1G contiguous area.  In a pool
> > as used by hugetlb, 1G pages are pre-allocated and sitting in the pool.  The
> > downside of such a pool is that the memory can not be used for other purposes
> > and sits 'idle' if not allocated.
> 
> Yes, I tried that. One big issue is that at free time a 1GB THP needs to be freed
> back to a CMA pool instead of buddy allocator, but THP can be split and after
> split, it is really hard to tell whether a page is from a CMA pool or not.
> 
> hugetlb pages does not support page split yet, so the issue might not be
> relevant. But if a THP cannot be split freely, is it a still THP? So it comes
> back to my question: do we really want 1GB THP or just core mm can handle
> any size folios?

We definitely want the core MM to be able to handle folios of arbitrary
size.  There are a pile of places still to fix, eg if you map a
misaligned 1GB page, you can see N PTEs followed by 511 PMDs followed by
512-N PTEs.  There are a lot of places that assume pmd_page() returns
both a head page and the precise page, and those will need to be fixed.
There's a reason I limit page cache to PMD_ORDER today.

> > Hate to even bring this up, but there are complaints today about 'allocation
> > time' of 1GB pages from the hugetlb pool.  This 'allocation time' is actually
> > the time it takes to clear/zero 1G of memory.  Only reason I mention is
> > using something like CMA to allocate 1G pages (at fault time) may add
> > unacceptable latency.
> 
> One solution I had in mind is that you could zero these 1GB pages at free
> time in a worker thread, so that you do not pay the penalty at page allocation
> time. But it would not work if the allocation comes right after a page is
> freed.

It rather goes against the principle of the user should pay the cost.
If we got the zeroing for free, that'd be one thing, but it feels like
we're robbing Peter (of CPU time) to pay Paul.

> At the end, let me ask this again: do we want 1GB THP to replace hugetlb
> or enable core mm to handle any size folios and change 1GB hugetlb page
> to a 1GB folio?

I don't see this as an either-or.  The core MM needs to be enhanced to
handle arbitrary sized folios, but the hugetlbfs interface needs to be
kept around for ever.  What we need from a maintainability point of view
is removing how special hugetlbfs is.