Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 09/01/25 5:01 am, Nico Pache wrote:
khugepaged scans PMD ranges for potential collapse to a hugepage. To add
mTHP support we use this scan to instead record chunks of fully utilized
sections of the PMD.

create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
by default we will set this to order 3. The reasoning is that for 4K 512
PMD size this results in a 64 bit bitmap which has some optimizations.
For other arches like ARM64 64K, we can set a larger order if needed.

khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
that represents chunks of fully utilized regions. We can then determine
what mTHP size fits best and in the following patch, we set this bitmap
while scanning the PMD.

max_ptes_none is used as a scale to determine how "full" an order must
be before being considered for collapse.

Signed-off-by: Nico Pache <npache@xxxxxxxxxx>

Here is my objective and subjective analysis.

--------------------- Mathematical Analysis ----------------------------

First off, in my series, I am missing one thing: When I fail to collapse a range as a result of exhausting all orders, I should jump to the next range starting with the alignment of order at which I just failed (i.e, the minimum allowable order). Currently I am exiting which is wasteful. This should be easy to extend.

Let's call Nico Pache's method NP, and Dev Jain's method DJ.

The only difference between NP and DJ is the remembrance of the state of the PTEs (I have already reverted to using write lock for my v2, see this reply: https://lore.kernel.org/all/71a2f471-3082-4ca2-ac48-2f664977282f@xxxxxxx/). NP saves empty and filled PTEs in a bitmap, and then uses the optimized (let us assume them to be constant time operations, hopefully?) bitmap APIs, like bitmap_shift_right(), and bitmap_weight(). The latter is what determines whether for a particular order, the range has enough filled PTEs to justify calling collapse_huge_page(). DJ does this naively with a brute force iteration. Now, the edge NP has over DJ is just before calling collapse_huge_page(). Post calling that, everything remains the same; assuming that both DJ and NP derive the same collapsed ranges, then, collapse_huge_page() succeeds in NP if and only if it succeeds in DJ. NP knows quickly, when and when not to call collapse_huge_page().

So the question is, how many iterations of PTE scans does NP save over DJ? We prove a stronger result:

Let the PTE table consist of 2^x pte entries, where x >= 2 belongs to the set of natural numbers (x >= 2 because anon collapse is not supported for x < 2). Let f(x) = #iterations performed by DJ in the worst case. The worst case is, all orders are enabled, and we have some distribution of the PTEs.

Lemma: f(x) <= 2^x * (x-1).

Proof: We perform weak mathematical induction on x. Assume zero-indexing, and assume the worst case that all orders are enabled.

Base case: Let x = 4. We have 16 entries. NP does 16 iterations. In the worst case, this what DJ may do: it will iterate all 16, and not collapse. Then it will iterate from 0-7 pte entries, and not collapse. Then, it will iterate from 0-3, and may or may not collapse. Here is the worst case (When I write l-r below, I mean the range l-r, both inclusive):

0-15 fail -> 0-7 fail -> 0-3 fail/success -> 4-7 fail/success -> 8-15 fail -> 8-11 fail/success -> 12-15 fail/success

#iterations = 16+8+4+4+8+4+4 = 48 = 2^4 * (4-1).
Convince yourself that f(2) == 4 and f(3) <= 16.

Inductive hypothesis: Assume the lemma is true for some x > 4.

We need to prove for x+1. Let X = 2^(x+1) - 1, and Y = 2^x - 1.
Let DJ start scanning from 0. If 0-X is success, we are done. So, assume 0-X fails. Now, DJ looks at 0-Y. Note that, for any x s.t 0 <= x <= X, if DJ starts scanning from x, there is no way it will cross the scan into the next half, i.e Y+1-X, since the scan length from x will be atmost the highest power-of-two alignment of x. Given this, we scan 0-Y completely, and then start from Y+1. Having established the above argument, we can use the inductive hypothesis on 0-Y and Y+1-X to derive that f(x) <= 2^(x+1) + 2f(x) <= 2^(x+1) + 2(2^x * (x-1)) = 2^(x+1) + 2^(x+1) * (x-1) = 2^(x+1) * (x). Q.E.D (You can simulate the proof for x=9; what I mean to say is, we can divide 0-511 into 0-255 and 256-511).

So, for our case, NP performs 512 iterations, and DJ performs in the worst case, 512 * 8 = 4096 iterations. Hmm...

----------------------- Subjective Analysis --------------------------

[1] The worst case is, well, the worst case; the practical case on arm64 machines is, only 2048k and 64k is enabled. So, NP performs 512 iterations, and DJ performs 512 + 16 * (number of 64K chunks) = 512 + 512 = 1024 iterations. That is not much difference.

[2] Both implementations have the creep problem described here:
https://lore.kernel.org/all/20241216165105.56185-13-dev.jain@xxxxxxx/

[3] The bitmaps are being created only for pte_none case, whereas we also have the shared and the swap case. In fact, for the none case, if we have PMD-order enabled, we will almost surely collapse to PMD size, given that the common case is khugepaged_max_ptes_none = 511: if we have one PTE filled, we will call collapse_huge_page(), and both DJ and NP will perform 512 iterations. Therefore, the bitmaps also need to be extended to the shared and the swap case so as to get any potential benefit from the idea in a practical scenario.

[4] NP does not handle scanning VMAs of size less than PMD. Since NP introduces a single entry point of khugepaged_collapse_single_pmd(), I am not sure how difficult it will be to extend the implementation, and given that, MTHP_BITMAP_SIZE is a compile time constant. I have extended this in my v2 and it works.

[5] In NP, for a bit to be set, the chunk completely needs to be filled/shared/swapped out. This completely changes the meaning of the sysfs parameters max_ptes_*. It also makes it very hard to debug since it may happen that, distribution D1 has more PTEs filled but less bits in the bitmap set than distribution D2. DJ also changes the meaning of the parameters due to scaling errors, but that is only an off-by-one error, therefore, the behaviour is easier to predict.

[6] In NP, we have: remember the state of the PTEs -> alloc_charge_folio() -> read_lock(), unlock() -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush for PMD. There is a significant time difference here, and the remembered PTEs may be vastly different from what we have now. Obviously I cannot pinpoint an exact number as to how bad this is or not for the accuracy of khugepaged. For DJ, since a particular PTE may come into the scan range multiple times, DJ gives the range a chance if the distribution changed recently.

[7] The last time I tried to save on #iterations of PTE entries, this happened:

https://lore.kernel.org/all/ZugxqJ-CjEi5lEW_@xxxxxxxxxxxxxxxxxxxx/

Matthew Wilcox pointed out a potential regression in a patch which was an "obvious optimization" to me on paper; I tested and it turned out he was correct:

https://lore.kernel.org/all/8700274f-b521-444e-8d17-c06039a1376c@xxxxxxx/

We could argue whether it is worth to have the bitmap memory initialization, copying, weight checking, and recursion overhead.

This is the most I can come up with by analyzing from a third person perspective :)





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux