Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64

David Hildenbrand <david@xxxxxxxxxx> · Thu, 25 Feb 2021 12:02:29 +0100

On 24.02.21 23:35, Zi Yan wrote:
From: Zi Yan <ziy@xxxxxxxxxx>

Hi all,

I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
and the code is available at
https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
if you want to give it a try. The actual 49 patches are not sent out with this
cover letter. :)

Instead of asking for code review, I would like to discuss on the concerns I got
from previous RFCs. I think there are two major ones:

1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
    regions that are reserved at boot time like hugetlbfs. The concerns on
    using CMA is that an educated guess is needed to avoid depleting kernel
    memory in case CMA regions are set too large. Recently David Rientjes
    proposes to use process_madvise() for hugepage collapse, which is an
    alternative [1] but might not work for 1GB pages, since there is no way of

I see two core ideas of THP:

1) Transparent to the user: you get speedup without really caring 
*except* having to enable/disable the optimization sometimes manually 
(i.e., MADV_HUGEPAGE) -  because in corner cases (e.g., userfaultfd), 
it's not completely transparent and might have performance impacts. 
mprotect(), mmap(MAP_FIXED), mremap() work as expected.

2) Transparent to other subsystems of the kernel: the page size of the 
mapping is in base pages - we can split anytime on demand in case we 
cannot handle THP. In addition, no special requirements: no CMA, no 
movability restrictions, no swappability restrictions, ... most stuff 
works transparently by splitting.

Your current approach messes with 2). Your proposal here messes with 1).

Any kind of explicit placement by the user can silently get reverted any 
time. So process_madvise() would really only be useful in cases where a 
temporary split might get reverted later on by the os automatically - 
like we have for 2MB THP right now.

So process_madvise() is less likely to help if the system won't try 
collapsing automatically (more below).

    _allocating_ a 1GB page to which collapse pages. I proposed a similar
    approach at LSF/MM 2019, generating physically contiguous memory after pages
    are allocated [2], which is usable for 1GB THPs. This approach does in-place
    huge page promotion thus does not require page allocation.

I like the idea of forming a 1GB THP at a location where already 
consecutive pages allow for it. It can be applied generically - and both 
1) and 2) keep working as expected. Anytime there was a split, we can 
retry forming a THP later.

However, I don't follow how this is actually really feasible in big 
scale. You could only ever collapse into a 1GB THP if you happen to have 
1GB consecutive 2MB THP / 4k already. Sounds to me like this happens 
when the stars align.

--
Thanks,

David / dhildenb