Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64

Zi Yan <ziy@xxxxxxxxxx> · Wed, 3 Mar 2021 18:42:53 -0500

On 2 Mar 2021, at 3:55, David Hildenbrand wrote:

>>>
>>> However, I don't follow how this is actually really feasible in big scale. You could only ever collapse into a 1GB THP if you happen to have 1GB consecutive 2MB THP / 4k already. Sounds to me like this happens when the stars align.
>>
>> Both the process_madvise() approach and my proposal require page migration to bring back THPs, since like you said having consecutive pages ready is extremely rare. IIUC, the process_madvise() approach reuses khugepaged code to collapse huge pages,
>> namely first allocating a 2MB THP, then copying data over, finally free old base pages. My proposal would migrate pages within
>> a virtual address range (>1GB and 1GB-aligned) to get all physical pages contiguous, then promote the resulting 1GB consecutive
>> pages to 1GB THP. No new page allocation is needed.
>
> I am missing how we can ever reliably form 1GB pages (esp. after the system ran for a while) without any kind of fragmentation avoidance / defragmentation mechanism that is aware of gigantic pages. For THP, pageblocks+compaction serve that purpose.

We may not have that as reliable as pageblocks+compaction for THP, but we are able to improve over existing code after 1GB THP
is supported and used. Otherwise, why bother adding a new mechanism when there is no user?

I did an experiment on my 32GB desktop like Roman suggested in another email, using as much memory as possible and running
“git gc” on Linux repo at the same time to fragment memory. I repeated the process three times with three different Linux repos.
I checked all pageblock types with my custom kernel module (https://github.com/x-y-z/kernel-modules) and discovered that
the system still have 11 1GB Movable pageblocks (consecutive pageblocks with the same migratetype are grouped as large as
possible). This means after heavy memory fragmentation the system is still able to form 11 1GB THPs, which is >30% of total
possible 1GB THPs. I think it is a reasonably good number since we are not going to form 1GB THPs for everything running
in the system.

>>
>> Both approaches would need user-space invocation, assuming either the application itself wants to get THPs for a specific region or a user-space daemon would do this for a group of application, instead of waiting for khugepaged to slowly (4096 pages every 10s) scan and do huge page collapse. User will pay the cost of getting THP. This also means THPs are not completely transparent to user, but I think it should be fine when users explicitly invoke these two methods to get THPs for better performance.
>
> Here is the problem: these *advises* are not persistent. Assume your system has to swap and has to split the THP + write it to the swap backend. The gigantic page is lost for that part of the application. When loading the individual 4k pages out of swap there is no guarantee that we can form a 1 GB page again - and how should we know that the application wanted a 1 GB page at that position?

VM_HUGEPAGE will be set for that VMA and I am planning to add a new field to VMA to indicate what huge page size we want in
that VMA. About split 1GB THP due to swapping, that happens to THP too. Either khugepaged or a user daemon calling
process_madvise() could recover 1GB THP.

>
> How would the application know that the advise was no dropped and that
> a) There is no 1GB page anymore
> b) It would have to re-issue the advise

I expected a daemon, either khugepaged or a user one calling process_mavise, would rescan the application and reform 1GB pages.

>
> Similarly, I am not convinced that the future of khugepaged is in user space.

The issue of khugepaged is that it runs at very slow rate, 4096 pages every 10s, because kernel does not want to consume
too much CPU resources without knowing the benefit of forming THPs. A user daemon can run at a fast pace to form THPs or
1GB THPs from application memory regions that users really want huge pages.

>
>>
>> The difference of my proposal is that it does not need a 1GB THP allocation, so there is no special requirements like using CMA
>> or increasing MAX_ORDER in buddy allocator to allow 1GB page allocation. It makes creating THPs with orders > MAX_ORDER possible
>> without other intrusive changes.
>
> Anything that relies on large allocations succeeding purely because "ZONE_NORMAL memory is usually not fragmented after boot" is broken by design. That's why we have CMA, it can give guarantees (well, once we fix all remaining issues :) ).

It seems that you are suggesting I should use CMA for 1GB THP allocation, since CMA can give guarantee for large allocations.
Using CMA for 1GB THP would be a great first step to get 1GB THP working, then we can replace it with other large allocation
mechanisms later.

—
Best Regards,
Yan Zi
Attachment:
signature.asc

Description: OpenPGP digital signature