Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

Zi Yan <ziy@xxxxxxxxxx> · Mon, 5 Oct 2020 14:05:17 -0400

On 5 Oct 2020, at 13:39, David Hildenbrand wrote:

>>>> consideting that 2MB THP have turned out to be quite a pain but
>>>> situation has settled over time. Maybe our current code base is prepared
>>>> for that much better.
>>
>> I am planning to refactor my code further to reduce the amount of
>> the added code, since PUD THP is very similar to PMD THP. One thing
>> I want to achieve is to enable split_huge_page to split any order of
>> pages to a group of any lower order of pages. A lot of code in this
>> patchset is replicating the same behavior of PMD THP at PUD level.
>> It might be possible to deduplicate most of the code.
>>
>>>>
>>>> Exposing that interface to the userspace is a different story of course.
>>>> I do agree that we likely do not want to be very explicit about that.
>>>> E.g. an interface for address space defragmentation without any more
>>>> specifics sounds like a useful feature to me. It will be up to the
>>>> kernel to decide which huge pages to use.
>>>
>>> Yes, I think one important feature would be that we don't end up placing
>>> a gigantic page where only a handful of pages are actually populated
>>> without green light from the application - because that's what some user
>>> space applications care about (not consuming more memory than intended.
>>> IIUC, this is also what this patch set does). I'm fine with placing
>>> gigantic pages if it really just "defragments" the address space layout,
>>> without filling unpopulated holes.
>>>
>>> Then, this would be mostly invisible to user space, and we really
>>> wouldn't have to care about any configuration.
>>
>>
>> I agree that the interface should be as simple as no configuration to
>> most users. But I also wonder why we have hugetlbfs to allow users to
>> specify different kinds of page sizes, which seems against the discussion
>> above. Are we assuming advanced users should always use hugetlbfs instead
>> of THPs?
>
> Well, with hugetlbfs you get a real control over which pagesizes to use.
> No mixture, guarantees.
>
> In some environments you might want to control which application gets
> which pagesize. I know of database applications and hypervisors that
> sometimes really want 2MB huge pages instead of 1GB huge pages. And
> sometimes you really want/need 1GB huge pages (e.g., low-latency
> applications, real-time KVM, ...).
>
> Simple example: KVM with postcopy live migration
>
> While 2MB huge pages work reasonably fine, migrating 1GB gigantic pages
> on demand (via userfaultdfd) is a painfully slow / impractical.

The real control of hugetlbfs comes from the interfaces provided by
the kernel. If kernel provides similar interfaces to control page sizes
of THPs, it should work the same as hugetlbfs. Mixing page sizes usually
comes from system memory fragmentation and hugetlbfs does not have this
mixture because of its special allocation pools not because of the code
itself. If THPs are allocated from the same pools, they would act
the same as hugetlbfs. What am I missing here?

I just do not get why hugetlbfs is so special that it can have pagesize
fine control when normal pages cannot get. The “it should be invisible
to userpsace” argument suddenly does not hold for hugetlbfs.

—
Best Regards,
Yan Zi
Attachment:
signature.asc

Description: OpenPGP digital signature