Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support

Dev Jain <dev.jain@xxxxxxx> · Sun, 12 Jan 2025 20:43:04 +0530

On 09/01/25 5:01 am, Nico Pache wrote:
khugepaged scans PMD ranges for potential collapse to a hugepage. To add
mTHP support we use this scan to instead record chunks of fully utilized
sections of the PMD.

create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
by default we will set this to order 3. The reasoning is that for 4K 512
PMD size this results in a 64 bit bitmap which has some optimizations.
For other arches like ARM64 64K, we can set a larger order if needed.

khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
that represents chunks of fully utilized regions. We can then determine
what mTHP size fits best and in the following patch, we set this bitmap
while scanning the PMD.

max_ptes_none is used as a scale to determine how "full" an order must
be before being considered for collapse.

Signed-off-by: Nico Pache <npache@xxxxxxxxxx>

Here is my objective and subjective analysis.

--------------------- Mathematical Analysis ----------------------------

First off, in my series, I am missing one thing: When I fail to collapse 
a range as a result of exhausting all orders, I should jump to the next 
range starting with the alignment of order at which I just failed (i.e, 
the minimum allowable order). Currently I am exiting which is wasteful. 
This should be easy to extend.

Let's call Nico Pache's method NP, and Dev Jain's method DJ.

The only difference between NP and DJ is the remembrance of the state of 
the PTEs (I have already reverted to using write lock for my v2, see 
this reply: 
https://lore.kernel.org/all/71a2f471-3082-4ca2-ac48-2f664977282f@xxxxxxx/). 
NP saves empty and filled PTEs in a bitmap, and then uses the optimized 
(let us assume them to be constant time operations, hopefully?) bitmap 
APIs, like bitmap_shift_right(), and bitmap_weight(). The latter is what 
determines whether for a particular order, the range has enough filled 
PTEs to justify calling collapse_huge_page(). DJ does this naively with 
a brute force iteration. Now, the edge NP has over DJ is just before 
calling collapse_huge_page(). Post calling that, everything remains the 
same; assuming that both DJ and NP derive the same collapsed ranges, 
then, collapse_huge_page() succeeds in NP if and only if it succeeds in 
DJ. NP knows quickly, when and when not to call collapse_huge_page().

So the question is, how many iterations of PTE scans does NP save over 
DJ? We prove a stronger result:

Let the PTE table consist of 2^x pte entries, where x >= 2 belongs to 
the set of natural numbers (x >= 2 because anon collapse is not 
supported for x < 2). Let f(x) = #iterations performed by DJ in the 
worst case. The worst case is, all orders are enabled, and we have some 
distribution of the PTEs.

Lemma: f(x) <= 2^x * (x-1).

Proof: We perform weak mathematical induction on x. Assume 
zero-indexing, and assume the worst case that all orders are enabled.

Base case: Let x = 4. We have 16 entries. NP does 16 iterations. In the 
worst case, this what DJ may do: it will iterate all 16, and not 
collapse. Then it will iterate from 0-7 pte entries, and not collapse. 
Then, it will iterate from 0-3, and may or may not collapse. Here is the 
worst case (When I write l-r below, I mean the range l-r, both inclusive):

0-15 fail -> 0-7 fail -> 0-3 fail/success -> 4-7 fail/success -> 8-15 
fail -> 8-11 fail/success -> 12-15 fail/success

#iterations = 16+8+4+4+8+4+4 = 48 = 2^4 * (4-1).
Convince yourself that f(2) == 4 and f(3) <= 16.

Inductive hypothesis: Assume the lemma is true for some x > 4.

We need to prove for x+1. Let X = 2^(x+1) - 1, and Y = 2^x - 1.
Let DJ start scanning from 0. If 0-X is success, we are done. So, assume 
0-X fails. Now, DJ looks at 0-Y. Note that, for any x s.t 0 <= x <= X, 
if DJ starts scanning from x, there is no way it will cross the scan 
into the next half, i.e Y+1-X, since the scan length from x will be 
atmost the highest power-of-two alignment of x. Given this, we scan 0-Y 
completely, and then start from Y+1. Having established the above 
argument, we can use the inductive hypothesis on 0-Y and Y+1-X to derive 
that f(x) <= 2^(x+1) + 2f(x) <= 2^(x+1) + 2(2^x * (x-1)) = 2^(x+1) + 
2^(x+1) * (x-1) = 2^(x+1) * (x). Q.E.D
(You can simulate the proof for x=9; what I mean to say is, we can 
divide 0-511 into 0-255 and 256-511).

So, for our case, NP performs 512 iterations, and DJ performs in the 
worst case, 512 * 8 = 4096 iterations. Hmm...

----------------------- Subjective Analysis --------------------------

[1] The worst case is, well, the worst case; the practical case on arm64 
machines is, only 2048k and 64k is enabled. So, NP performs 512 
iterations, and DJ performs 512 + 16 * (number of 64K chunks) = 512 + 
512 = 1024 iterations. That is not much difference.

[2] Both implementations have the creep problem described here:
https://lore.kernel.org/all/20241216165105.56185-13-dev.jain@xxxxxxx/

[3] The bitmaps are being created only for pte_none case, whereas we 
also have the shared and the swap case. In fact, for the none case, if 
we have PMD-order enabled, we will almost surely collapse to PMD size, 
given that the common case is khugepaged_max_ptes_none = 511: if we have 
one PTE filled, we will call collapse_huge_page(), and both DJ and NP 
will perform 512 iterations. Therefore, the bitmaps also need to be 
extended to the shared and the swap case so as to get any potential 
benefit from the idea in a practical scenario.

[4] NP does not handle scanning VMAs of size less than PMD. Since NP 
introduces a single entry point of khugepaged_collapse_single_pmd(), I 
am not sure how difficult it will be to extend the implementation, and 
given that, MTHP_BITMAP_SIZE is a compile time constant. I have extended 
this in my v2 and it works.

[5] In NP, for a bit to be set, the chunk completely needs to be 
filled/shared/swapped out. This completely changes the meaning of the 
sysfs parameters max_ptes_*. It also makes it very hard to debug since 
it may happen that, distribution D1 has more PTEs filled but less bits 
in the bitmap set than distribution D2. DJ also changes the meaning of 
the parameters due to scaling errors, but that is only an off-by-one 
error, therefore, the behaviour is easier to predict.

[6] In NP, we have: remember the state of the PTEs -> 
alloc_charge_folio() -> read_lock(), unlock() -> mmap_write_lock() -> 
anon_vma_lock_write() -> TLB flush for PMD. There is a significant time 
difference here, and the remembered PTEs may be vastly different from 
what we have now. Obviously I cannot pinpoint an exact number as to how 
bad this is or not for the accuracy of khugepaged. For DJ, since a 
particular PTE may come into the scan range multiple times, DJ gives the 
range a chance if the distribution changed recently.

[7] The last time I tried to save on #iterations of PTE entries, this 
happened:

https://lore.kernel.org/all/ZugxqJ-CjEi5lEW_@xxxxxxxxxxxxxxxxxxxx/

Matthew Wilcox pointed out a potential regression in a patch which was 
an "obvious optimization" to me on paper; I tested and it turned out he 
was correct:

https://lore.kernel.org/all/8700274f-b521-444e-8d17-c06039a1376c@xxxxxxx/

We could argue whether it is worth to have the bitmap memory 
initialization, copying, weight checking, and recursion overhead.

This is the most I can come up with by analyzing from a third person 
perspective :)