Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8 Jun 2023, at 17:23, Mike Kravetz wrote:

> On 06/08/23 11:50, Yang Shi wrote:
>> On Wed, Jun 7, 2023 at 11:34 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>>>
>>> On 08.06.23 02:02, David Rientjes wrote:
>>>> On Wed, 7 Jun 2023, Mike Kravetz wrote:
>>>>
>>>>>>>>> Are there strong objections to extending hugetlb for this support?
>>>>>>>>
>>>>>>>> I don't want to get too involved in this discussion (busy), but I
>>>>>>>> absolutely agree on the points that were raised at LSF/MM that
>>>>>>>>
>>>>>>>> (A) hugetlb is complicated and very special (many things not integrated
>>>>>>>> with core-mm, so we need special-casing all over the place). [example:
>>>>>>>> what is a pte?]
>>>>>>>>
>>>>>>>> (B) We added a bunch of complexity in the past that some people
>>>>>>>> considered very important (and it was not feature frozen, right? ;) ).
>>>>>>>> Looking back, we might just not have done some of that, or done it
>>>>>>>> differently/cleaner -- better integrated in the core. (PMD sharing,
>>>>>>>> MAP_PRIVATE, a reservation mechanism that still requires preallocation
>>>>>>>> because it fails with NUMA/fork, ...)
>>>>>>>>
>>>>>>>> (C) Unifying hugetlb and the core looks like it's getting more and more
>>>>>>>> out of reach, maybe even impossible with all the complexity we added
>>>>>>>> over the years (well, and keep adding).
>>>>>>>>
>>>>>>>> Sure, HGM for the purpose of better hwpoison handling makes sense. But
>>>>>>>> hugetlb is probably 20 years old and hwpoison handling probably 13 years
>>>>>>>> old. So we managed to get quite far without that optimization.
>>>>>>>>
>>>>
>>>> Sane handling for memory poisoning and optimizations for live migration
>>>> are both much more important for the real-world 1GB hugetlb user, so it
>>>> doesn't quite have that lengthy of a history.
>>>>
>>>> Unfortuantely, cloud providers receive complaints about both of these from
>>>> customers.  They are one of the most significant causes for poor customer
>>>> experience.
>>>>
>>>> While people have proposed 1GB THP support in the past, it was nacked, in
>>>> part, because of the suggestion to just use existing 1GB support in
>>>> hugetlb instead :)
>>
>> Yes, but it was before HGM was proposed, we may revisit it.
>>
>
> Adding Zi Yan on CC as the person driving 1G THP.

Thanks.

I have not attended the LSF/MM, but the points above mostly look valid.
IMHO, if we keep adding new features to hugetlbfs, we might have two
parallel memory systems, replicating each other a lot. Maybe it is the time
to think about how to merge hugetlbfs features back to core mm.

From my understanding, the most desirable user visible feature of hugetlbfs
is that it provides deterministic huge page allocation, since huge pages
are preserved. If we can preserve that, replacing hugetlbfs backend with
THP or even just plain folio should be good enough. Let me know if I miss
any important user visible feature.

On the hugetlbfs backend, PMD sharing, MAP_PRIVATE, reducing struct page
storage all look features core mm might want. Merging these features back
to core mm might be a good first step.

I thought about replacing hugetlbfs backend with THP (with my 1GB THP support),
but find that not all THP features are necessary for hugetlbfs users or
compatible with existing hugetlbfs. For example, hugetlbfs does not need
transparent page split, since user just wants that big page size. And page
split might not get along with reducing struct page storage feature.

In sum, I think we might not need all THP features (page table entry split
and huge page split) to replace hugetlbfs and we might just need to enable
core mm to handle any size folio and hugetlb pages are just folios that
can go as large as 1GB. As a result, hugetlb pages can take advantage of
all core mm features, like hwpoison.

>>>
>>> Yes, because I still think that the use for "transparent" (for the user)
>>> nowadays is very limited and not worth the complexity.
>>>
>>> IMHO, what you really want is a pool of large pages that (guarantees
>>> about availability and nodes) and fine control about who gets these
>>> pages. That's what hugetlb provides.
>>
>> The most concern for 1G THP is the allocation time. But I don't think
>> it is a no-go for allocating THP from a preallocated pool, for
>> example, CMA.
>
> I seem to remember Zi trying to use CMA for 1G THP allocations.  However, I
> am not sure if using CMA would be sufficient.  IIUC, allocating from CMA could
> still require page migrations to put together a 1G contiguous area.  In a pool
> as used by hugetlb, 1G pages are pre-allocated and sitting in the pool.  The
> downside of such a pool is that the memory can not be used for other purposes
> and sits 'idle' if not allocated.

Yes, I tried that. One big issue is that at free time a 1GB THP needs to be freed
back to a CMA pool instead of buddy allocator, but THP can be split and after
split, it is really hard to tell whether a page is from a CMA pool or not.

hugetlb pages does not support page split yet, so the issue might not be
relevant. But if a THP cannot be split freely, is it a still THP? So it comes
back to my question: do we really want 1GB THP or just core mm can handle
any size folios?

>
> Hate to even bring this up, but there are complaints today about 'allocation
> time' of 1GB pages from the hugetlb pool.  This 'allocation time' is actually
> the time it takes to clear/zero 1G of memory.  Only reason I mention is
> using something like CMA to allocate 1G pages (at fault time) may add
> unacceptable latency.

One solution I had in mind is that you could zero these 1GB pages at free
time in a worker thread, so that you do not pay the penalty at page allocation
time. But it would not work if the allocation comes right after a page is
freed.

>
>>>
>>> In contrast to THP, you don't want to allow for
>>> * Partially mmap, mremap, munmap, mprotect them
>>> * Partially sharing then / COW'ing them
>>> * Partially mixing them with other anon pages (MADV_DONTNEED + refault)
>>
>> IIRC, QEMU treats hugetlbfs as 2M block size, we should be able to
>> teach QEMU to treat tmpfs + THP as 2M block size too. I used to have a
>> patch to make stat.st_blksize return THP size for tmpfs (89fdcd262fd4
>> mm: shmem: make stat.st_blksize return huge page size if THP is on).
>> So when the applications are aware of the 2M or 1G page/block size,
>> hopefully it may help reduce the partial mapping things. But I'm not
>> an expert on QEMU, I may miss something.
>>
>>> * Exclude them from some features KSM/swap
>>> * (swap them out and eventually split them for that)
>>
>> We have "noswap" mount option for tmpfs now, so swap is not a problem.
>>
>> But we may lose some features, for example, PMD sharing, hugetlb
>> cgroup, etc. Not sure whether they are a showstopper or not.
>>
>> So it sounds easier to have 1G THP than HGM IMHO if I don't miss
>> something vital.
>
> I have always wanted to experiment with having THP use a pre-allocated
> pool for huge page allocations.  Of course, this adds the complication
> of what to do when the pool is exhausted.
>
> Perhaps Zi has performed such experiments?

Using CMA allocation is a similar experiment, but when CMA pools are
exhausted, 1GB THP allocation will fail. We can try to use compaction to
get more 1GB free pages, but that might take prohibitively long time
and could fail at the end.

At the end, let me ask this again: do we want 1GB THP to replace hugetlb
or enable core mm to handle any size folios and change 1GB hugetlb page
to a 1GB folio?

--
Best Regards,
Yan, Zi

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux