Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Thu, 8 Jun 2023 14:23:36 -0700

On 06/08/23 11:50, Yang Shi wrote:
> On Wed, Jun 7, 2023 at 11:34 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >
> > On 08.06.23 02:02, David Rientjes wrote:
> > > On Wed, 7 Jun 2023, Mike Kravetz wrote:
> > >
> > >>>>>> Are there strong objections to extending hugetlb for this support?
> > >>>>>
> > >>>>> I don't want to get too involved in this discussion (busy), but I
> > >>>>> absolutely agree on the points that were raised at LSF/MM that
> > >>>>>
> > >>>>> (A) hugetlb is complicated and very special (many things not integrated
> > >>>>> with core-mm, so we need special-casing all over the place). [example:
> > >>>>> what is a pte?]
> > >>>>>
> > >>>>> (B) We added a bunch of complexity in the past that some people
> > >>>>> considered very important (and it was not feature frozen, right? ;) ).
> > >>>>> Looking back, we might just not have done some of that, or done it
> > >>>>> differently/cleaner -- better integrated in the core. (PMD sharing,
> > >>>>> MAP_PRIVATE, a reservation mechanism that still requires preallocation
> > >>>>> because it fails with NUMA/fork, ...)
> > >>>>>
> > >>>>> (C) Unifying hugetlb and the core looks like it's getting more and more
> > >>>>> out of reach, maybe even impossible with all the complexity we added
> > >>>>> over the years (well, and keep adding).
> > >>>>>
> > >>>>> Sure, HGM for the purpose of better hwpoison handling makes sense. But
> > >>>>> hugetlb is probably 20 years old and hwpoison handling probably 13 years
> > >>>>> old. So we managed to get quite far without that optimization.
> > >>>>>
> > >
> > > Sane handling for memory poisoning and optimizations for live migration
> > > are both much more important for the real-world 1GB hugetlb user, so it
> > > doesn't quite have that lengthy of a history.
> > >
> > > Unfortuantely, cloud providers receive complaints about both of these from
> > > customers.  They are one of the most significant causes for poor customer
> > > experience.
> > >
> > > While people have proposed 1GB THP support in the past, it was nacked, in
> > > part, because of the suggestion to just use existing 1GB support in
> > > hugetlb instead :)
> 
> Yes, but it was before HGM was proposed, we may revisit it.
> 

Adding Zi Yan on CC as the person driving 1G THP.

> >
> > Yes, because I still think that the use for "transparent" (for the user)
> > nowadays is very limited and not worth the complexity.
> >
> > IMHO, what you really want is a pool of large pages that (guarantees
> > about availability and nodes) and fine control about who gets these
> > pages. That's what hugetlb provides.
> 
> The most concern for 1G THP is the allocation time. But I don't think
> it is a no-go for allocating THP from a preallocated pool, for
> example, CMA.

I seem to remember Zi trying to use CMA for 1G THP allocations.  However, I
am not sure if using CMA would be sufficient.  IIUC, allocating from CMA could
still require page migrations to put together a 1G contiguous area.  In a pool
as used by hugetlb, 1G pages are pre-allocated and sitting in the pool.  The
downside of such a pool is that the memory can not be used for other purposes
and sits 'idle' if not allocated.

Hate to even bring this up, but there are complaints today about 'allocation
time' of 1GB pages from the hugetlb pool.  This 'allocation time' is actually
the time it takes to clear/zero 1G of memory.  Only reason I mention is
using something like CMA to allocate 1G pages (at fault time) may add
unacceptable latency.

> >
> > In contrast to THP, you don't want to allow for
> > * Partially mmap, mremap, munmap, mprotect them
> > * Partially sharing then / COW'ing them
> > * Partially mixing them with other anon pages (MADV_DONTNEED + refault)
> 
> IIRC, QEMU treats hugetlbfs as 2M block size, we should be able to
> teach QEMU to treat tmpfs + THP as 2M block size too. I used to have a
> patch to make stat.st_blksize return THP size for tmpfs (89fdcd262fd4
> mm: shmem: make stat.st_blksize return huge page size if THP is on).
> So when the applications are aware of the 2M or 1G page/block size,
> hopefully it may help reduce the partial mapping things. But I'm not
> an expert on QEMU, I may miss something.
> 
> > * Exclude them from some features KSM/swap
> > * (swap them out and eventually split them for that)
> 
> We have "noswap" mount option for tmpfs now, so swap is not a problem.
> 
> But we may lose some features, for example, PMD sharing, hugetlb
> cgroup, etc. Not sure whether they are a showstopper or not.
> 
> So it sounds easier to have 1G THP than HGM IMHO if I don't miss
> something vital.

I have always wanted to experiment with having THP use a pre-allocated
pool for huge page allocations.  Of course, this adds the complication
of what to do when the pool is exhausted.

Perhaps Zi has performed such experiments?
-- 
Mike Kravetz