Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04.03.21 00:42, Zi Yan wrote:
On 2 Mar 2021, at 3:55, David Hildenbrand wrote:


However, I don't follow how this is actually really feasible in big scale. You could only ever collapse into a 1GB THP if you happen to have 1GB consecutive 2MB THP / 4k already. Sounds to me like this happens when the stars align.

Both the process_madvise() approach and my proposal require page migration to bring back THPs, since like you said having consecutive pages ready is extremely rare. IIUC, the process_madvise() approach reuses khugepaged code to collapse huge pages,
namely first allocating a 2MB THP, then copying data over, finally free old base pages. My proposal would migrate pages within
a virtual address range (>1GB and 1GB-aligned) to get all physical pages contiguous, then promote the resulting 1GB consecutive
pages to 1GB THP. No new page allocation is needed.

I am missing how we can ever reliably form 1GB pages (esp. after the system ran for a while) without any kind of fragmentation avoidance / defragmentation mechanism that is aware of gigantic pages. For THP, pageblocks+compaction serve that purpose.

We may not have that as reliable as pageblocks+compaction for THP, but we are able to improve over existing code after 1GB THP
is supported and used. Otherwise, why bother adding a new mechanism when there is no user?

I did an experiment on my 32GB desktop like Roman suggested in another email, using as much memory as possible and running
“git gc” on Linux repo at the same time to fragment memory. I repeated the process three times with three different Linux repos.
I checked all pageblock types with my custom kernel module (https://github.com/x-y-z/kernel-modules) and discovered that
the system still have 11 1GB Movable pageblocks (consecutive pageblocks with the same migratetype are grouped as large as
possible). This means after heavy memory fragmentation the system is still able to form 11 1GB THPs, which is >30% of total
possible 1GB THPs. I think it is a reasonably good number since we are not going to form 1GB THPs for everything running
in the system.


I'm sorry, but I don't think this is a relevant reproducer for fragmentation with unmovable allocations.

I feel like repeating myself: Anything that relies on large allocations succeeding purely because "ZONE_NORMAL memory is usually not fragmented after boot" is broken by design.

If your approach does not have any such approach, it's broken by design and only works in some very limited setups / under very limited conditions. We don't want anything like that when it severely affects the code ("49 patches").


Both approaches would need user-space invocation, assuming either the application itself wants to get THPs for a specific region or a user-space daemon would do this for a group of application, instead of waiting for khugepaged to slowly (4096 pages every 10s) scan and do huge page collapse. User will pay the cost of getting THP. This also means THPs are not completely transparent to user, but I think it should be fine when users explicitly invoke these two methods to get THPs for better performance.

Here is the problem: these *advises* are not persistent. Assume your system has to swap and has to split the THP + write it to the swap backend. The gigantic page is lost for that part of the application. When loading the individual 4k pages out of swap there is no guarantee that we can form a 1 GB page again - and how should we know that the application wanted a 1 GB page at that position?

VM_HUGEPAGE will be set for that VMA and I am planning to add a new field to VMA to indicate what huge page size we want in
that VMA. About split 1GB THP due to swapping, that happens to THP too. Either khugepaged or a user daemon calling
process_madvise() could recover 1GB THP.


Sorry, but for any kind of advise like "please collapse this into a 1GB page", splitting VMAs does not make any sense. Then, you can just let the application mmap(MAP_HUGE ...) that part instead - you also get a separate VMA and need the mmap lock in write.

Ordinary THP can be recovered quite well because *we have actual mechanisms in place that try to form contiguous 2MB (->pageblock) chunks*.


How would the application know that the advise was no dropped and that
a) There is no 1GB page anymore
b) It would have to re-issue the advise

I expected a daemon, either khugepaged or a user one calling process_mavise, would rescan the application and reform 1GB pages.


From user space? How should it know about whether that application has hugepages enabled/disabled for some regions? How should it know if we have to special case uffd?

I repeat: I am not convinced that the future of khugepaged is in user space. It might be valuable for some minor hints from the application itself - "please collapse this into a THP if possible", but not more - IMHO, but not across applications.


Similarly, I am not convinced that the future of khugepaged is in user space.

The issue of khugepaged is that it runs at very slow rate, 4096 pages every 10s, because kernel does not want to consume
too much CPU resources without knowing the benefit of forming THPs. A user daemon can run at a fast pace to form THPs or
1GB THPs from application memory regions that users really want huge pages.


Not sure we really want a daemon. You could just kick khugepaged instead - for example to run on a specific process. I think if - at all - it should be the application that gives additional advises. But that is a different discussion than 1 GB THP.



The difference of my proposal is that it does not need a 1GB THP allocation, so there is no special requirements like using CMA
or increasing MAX_ORDER in buddy allocator to allow 1GB page allocation. It makes creating THPs with orders > MAX_ORDER possible
without other intrusive changes.

Anything that relies on large allocations succeeding purely because "ZONE_NORMAL memory is usually not fragmented after boot" is broken by design. That's why we have CMA, it can give guarantees (well, once we fix all remaining issues :) ).

It seems that you are suggesting I should use CMA for 1GB THP allocation, since CMA can give guarantee for large allocations.
Using CMA for 1GB THP would be a great first step to get 1GB THP working, then we can replace it with other large allocation
mechanisms later.

No, as already expressed multiple times, I don't think this is the right thing to do.

--
Thanks,

David / dhildenb






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux