Re: [RFC v2 0/9] khugepaged: mTHP support

Nico Pache <npache@xxxxxxxxxx> · Mon, 17 Feb 2025 12:19:02 -0700




On Mon, Feb 17, 2025 at 1:06 AM Dev Jain <dev.jain@xxxxxxx> wrote:
>
>
>
> On 15/02/25 12:08 pm, Dev Jain wrote:
> >
> >
> > On 15/02/25 6:22 am, Nico Pache wrote:
> >> On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@xxxxxxx> wrote:
> >>>
> >>>
> >>>
> >>> On 14/02/25 1:09 am, Nico Pache wrote:
> >>>> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@xxxxxxx> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 12/02/25 10:19 pm, Nico Pache wrote:
> >>>>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@xxxxxxx> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 11/02/25 6:00 am, Nico Pache wrote:
> >>>>>>>> The following series provides khugepaged and madvise collapse
> >>>>>>>> with the
> >>>>>>>> capability to collapse regions to mTHPs.
> >>>>>>>>
> >>>>>>>> To achieve this we generalize the khugepaged functions to no
> >>>>>>>> longer depend
> >>>>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks
> >>>>>>>> of pages
> >>>>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> >>>>>>>> using a bitmap. After the PMD scan is done, we do binary
> >>>>>>>> recursion on the
> >>>>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The
> >>>>>>>> restriction
> >>>>>>>> on max_ptes_none is removed during the scan, to make sure we
> >>>>>>>> account for
> >>>>>>>> the whole PMD range. max_ptes_none will be scaled by the
> >>>>>>>> attempted collapse
> >>>>>>>> order to determine how full a THP must be to be eligible. If a
> >>>>>>>> mTHP collapse
> >>>>>>>> is attempted, but contains swapped out, or shared pages, we dont
> >>>>>>>> perform the
> >>>>>>>> collapse.
> >>>>>>>>
> >>>>>>>> With the default max_ptes_none=511, the code should keep its
> >>>>>>>> most of its
> >>>>>>>> original behavior. To exercise mTHP collapse we need to set
> >>>>>>>> max_ptes_none<=255.
> >>>>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse
> >>>>>>>> "creep" and
> >>>>>>>> constantly promote mTHPs to the next available size.
> >>>>>>>>
> >>>>>>>> Patch 1:     Some refactoring to combine madvise_collapse and
> >>>>>>>> khugepaged
> >>>>>>>> Patch 2:     Refactor/rename hpage_collapse
> >>>>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> >>>>>>>> Patch 6-9:   The mTHP patches
> >>>>>>>>
> >>>>>>>> ---------
> >>>>>>>>      Testing
> >>>>>>>> ---------
> >>>>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
> >>>>>>>> - selftests mm
> >>>>>>>> - I created a test script that I used to push khugepaged to its
> >>>>>>>> limits while
> >>>>>>>>        monitoring a number of stats and tracepoints. The code is
> >>>>>>>> available
> >>>>>>>>        here[1] (Run in legacy mode for these changes and set
> >>>>>>>> mthp sizes to inherit)
> >>>>>>>>        The summary from my testings was that there was no
> >>>>>>>> significant regression
> >>>>>>>>        noticed through this test. In some cases my changes had
> >>>>>>>> better collapse
> >>>>>>>>        latencies, and was able to scan more pages in the same
> >>>>>>>> amount of time/work,
> >>>>>>>>        but for the most part the results were consistant.
> >>>>>>>> - redis testing. I tested these changes along with my defer changes
> >>>>>>>>       (see followup post for more details).
> >>>>>>>> - some basic testing on 64k page size.
> >>>>>>>> - lots of general use. These changes have been running in my VM
> >>>>>>>> for some time.
> >>>>>>>>
> >>>>>>>> Changes since V1 [2]:
> >>>>>>>> - Minor bug fixes discovered during review and testing
> >>>>>>>> - removed dynamic allocations for bitmaps, and made them stack
> >>>>>>>> based
> >>>>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> >>>>>>>> - Updated trace events to include collapsing order info.
> >>>>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100
> >>>>>>>> scale.
> >>>>>>>> - No longer require a chunk to be fully utilized before setting
> >>>>>>>> the bit. Use
> >>>>>>>>        the same max_ptes_none scaling principle to achieve this.
> >>>>>>>> - Skip mTHP collapse that requires swapin or shared handling.
> >>>>>>>> This helps prevent
> >>>>>>>>        some of the "creep" that was discovered in v1.
> >>>>>>>>
> >>>>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> >>>>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-
> >>>>>>>> npache@xxxxxxxxxx/
> >>>>>>>>
> >>>>>>>> Nico Pache (9):
> >>>>>>>>       introduce khugepaged_collapse_single_pmd to unify
> >>>>>>>> khugepaged and
> >>>>>>>>         madvise_collapse
> >>>>>>>>       khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>>>>>>       khugepaged: generalize hugepage_vma_revalidate for mTHP
> >>>>>>>> support
> >>>>>>>>       khugepaged: generalize alloc_charge_folio for mTHP support
> >>>>>>>>       khugepaged: generalize __collapse_huge_page_* for mTHP
> >>>>>>>> support
> >>>>>>>>       khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>>>>>>       khugepaged: add mTHP support
> >>>>>>>>       khugepaged: improve tracepoints for mTHP orders
> >>>>>>>>       khugepaged: skip collapsing mTHP to smaller orders
> >>>>>>>>
> >>>>>>>>      include/linux/khugepaged.h         |   4 +
> >>>>>>>>      include/trace/events/huge_memory.h |  34 ++-
> >>>>>>>>      mm/khugepaged.c                    | 422 ++++++++++++++++++
> >>>>>>>> +----------
> >>>>>>>>      3 files changed, 306 insertions(+), 154 deletions(-)
> >>>>>>>>
> >>>>>>>
> >>>>>>> Does this patchset suffer from the problem described here:
> >>>>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-
> >>>>>>> c2d48d4963b6@xxxxxxx/
> >>>>>> Hi Dev,
> >>>>>>
> >>>>>> Sorry I meant to get back to you about that.
> >>>>>>
> >>>>>> I understand your concern, but like I've mentioned before, the scan
> >>>>>> with the read lock was done so we dont have to do the more expensive
> >>>>>> locking, and could still gain insight into the state. You are right
> >>>>>> that this info could become stale if the state changes dramatically,
> >>>>>> but the collapse_isolate function will verify it and not collapse.
> >>>>>
> >>>>> If the state changes dramatically, the _isolate function will
> >>>>> verify it,
> >>>>> and fallback. And this fallback happens after following this costly
> >>>>> path: retrieve a large folio from the buddy allocator -> swapin pages
> >>>>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB
> >>>>> flush
> >>>>> on all CPUs -> fallback in _isolate().
> >>>>> If you do fail in _isolate(), doesn't it make sense to get the updated
> >>>>> state for the next fallback order immediately, because we have prior
> >>>>> information that we failed because of PTE state? What your algorithm
> >>>>> will do is *still* follow the costly path described above, and again
> >>>>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd()
> >>>>> like
> >>>>> mine would.
> >>>>
> >>>> You do raise a valid point here, I can optimize my solution by
> >>>> detecting certain collapse failure types and jump to the next scan.
> >>>> I'll add that to my solution, thanks!
> >>>>
> >>>> As for the disagreement around the bitmap, we'll leave that up to the
> >>>> community to decide since we have differing opinions/solutions.
> >>>>
> >>>>>
> >>>>> The verification of the PTE state by the _isolate() function is the
> >>>>> "no
> >>>>> turning back" point of the algorithm. The verification by
> >>>>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even
> >>>>> worth
> >>>>> it, before we do costly operations" point of the algorithm.
> >>>>>
> >>>>>>    From my testing I found this to rarely happen.
> >>>>>
> >>>>> Unfortunately, I am not very familiar with performance testing/load
> >>>>> testing, I am fairly new to kernel programming, so I am getting there.
> >>>>> But it really depends on the type of test you are running, what
> >>>>> actually
> >>>>> runs on memory-intensive systems, etc etc. In fact, on loaded
> >>>>> systems I
> >>>>> would expect the PTE state to dramatically change. But still, no
> >>>>> opinion
> >>>>> here.
> >>>>
> >>>> Yeah there are probably some cases where it happens more often.
> >>>> Probably in cases of short lived allocations, but khugepaged doesn't
> >>>> run that frequently so those won't be that big of an issue.
> >>>>
> >>>> Our performance team is currently testing my implementation so I
> >>>> should have more real workload test results soon. The redis testing
> >>>> had some gains and didn't show any signs of obvious regressions.
> >>>>
> >>>> As for the testing, check out
> >>>> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-
> >>>> khuge-performance.sh?ref_type=heads
> >>>> this does the tracing for my testing script. It can help you get
> >>>> started. There are 3 different traces being applied there: the
> >>>> bpftrace for collapse latencies, the perf record for the flamegraph
> >>>> (not actually that useful, but may be useful to visualize any
> >>>> weird/long paths that you may not have noticed), and the trace-cmd
> >>>> which records the tracepoint of the scan and the collapse functions
> >>>> then processes the data using the awk script-- the output being the
> >>>> scan rate, the pages collapsed, and their result status (grouped by
> >>>> order).
> >>>>
> >>>> You can also look into https://github.com/gormanm/mmtests for
> >>>> testing/comparing kernels. I was running the
> >>>> config-memdb-redis-benchmark-medium workload.
> >>>
> >>> Thanks. I'll take a look.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> Also, khugepaged, my changes, and your changes are all a victim of
> >>>>>> this. Once we drop the read lock (to either allocate the folio, or
> >>>>>> right before acquiring the write_lock), the state can change. In your
> >>>>>> case, yes, you are gathering more up to date information, but is it
> >>>>>> really that important/worth it to retake locks and rescan for each
> >>>>>> instance if we are about to reverify with the write lock taken?
> >>>>>
> >>>>> You said "reverify": You are removing the verification, so this step
> >>>>> won't be reverification, it will be verification. We do not want to
> >>>>> verify *after* we have already done 95% of latency-heavy stuff,
> >>>>> only to
> >>>>> know that we are going to fail.
> >>>>>
> >>>>> Algorithms in the kernel, in general, are of the following form: 1)
> >>>>> Verify if a condition is true, resulting in taking a control path -
> >>>>> > 2)
> >>>>> do a lot of stuff -> "no turning back" step, wherein before committing
> >>>>> (by taking locks, say), reverify if this is the control path we should
> >>>>> be in. You are eliminating step 1).
> >>>>>
> >>>>> Therefore, I will have to say that I disagree with your approach.
> >>>>>
> >>>>> On top of this, in the subjective analysis in [1], point number 7
> >>>>> (along
> >>>>> with point number 1) remains. And, point number 4 remains.
> >>>>
> >>>> for 1) your worst case of 1024 is not the worst case. There are 8
> >>>> possible orders in your implementation, if all are enabled, that is
> >>>> 4096 iterations in the worst case.
> >>>
> >>> Yes, that is exactly what I wrote in 1). I am still not convinced that
> >>> the overhead you produce + 512 iterations is going to beat 4096
> >>> iterations. Anyways, that is hand-waving and we should test this.
> >>>
> >>>> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096
> >>>> in my case.
> >>>
> >>> Sorry, I am missing something here; how does the number of iterations
> >>> change with page size? Am I not scanning the PTE table, which is
> >>> invariant to the page size?
> >>
> >> I got the calculation wrong the first time and it's actually worst.
> >> Lets hope I got this right this time
> >> on ARM64 64k kernel:
> >> PMD size = 512M
> >> PTE= 64k
> >> PTEs per PMD = 8192
> >
> > *facepalm* my bad, thanks. I got thrown off thinking HPAGE_PMD_NR won't
> > depend on page size, but #pte entries = PAGE_SIZE / sizeof(pte) =
> > PAGE_SIZE / 8. So it does depend. You are correct, the PTEs per PMD is 1
> > << 13.
> >
> >> log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the
> >> first and second order are skipped)
> >>
> >> Assuming I understand your algorithm correctly, in the worst case you
> >> are scanning the whole PMD for each order.
> >>
> >> So you scan 8192 PTEs 11 times. 8192 * 11 = 90112.
> >
> > Yup. Now it seems that the bitmap overhead may just be worth it; for the
> > worst case the bitmap will give us an 11x saving...for the average case,
> > it will give us 2x, but still, 8192 is a large number. I'll think of
>
> Clearing on this: the saving is w.r.t the initial scan. That is, if time
> taken by NP is x + y + collapse_huge_page(), where x is the PMD scan and
> y is the bitmap overhead, then time taken by DJ is 2x +
> collapse_huge_page(). In collapse_huge_page(), both perform PTE scans in
> _isolate(). Anyhow, we differ in opinion as to where the max_ptes_*
> check should be placed; I recalled the following:
>
> https://lore.kernel.org/all/20240809103129.365029-2-dev.jain@xxxxxxx/
> https://lore.kernel.org/all/761ba58e-9d6f-4a14-a513-dcc098c2aa94@xxxxxxxxxx/
>
> One thing you can do to relieve one of my criticisms (not completely) is
> apply the following patch (this can be done in both methods):
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b589f889bb5a..dc5cb602eaad 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1080,8 +1080,14 @@ static int __collapse_huge_page_swapin(struct
> mm_struct *mm,
>                 }
>
>                 vmf.orig_pte = ptep_get_lockless(pte);
> -               if (!is_swap_pte(vmf.orig_pte))
> +               if (!is_swap_pte(vmf.orig_pte)) {
>                         continue;
> +               } else {
> +                       if (order != HPAGE_PMD_ORDER) {
> +                               result = SCAN_EXCEED_SWAP_PTE;
> +                               goto out;
> +                       }
> +               }
>
>                 vmf.pte = pte;
>                 vmf.ptl = ptl;
> --
>
> But this really is the same thing being done in the links above :)

Yes my code already does this,
@@ -1035,6 +1039,11 @@ static int __collapse_huge_page_swapin(struct
mm_struct *mm,
                if (!is_swap_pte(vmf.orig_pte))
                        continue;

+               if (order != HPAGE_PMD_ORDER) {
+                       result = SCAN_EXCEED_SWAP_PTE;
+                       goto out;
+               }
+
No need for the if/else nesting.
>
>
> > ways to test this out.
> >
> > Btw, I was made aware that an LWN article just got posted on our work!
> > https://lwn.net/Articles/1009039/
> >
> >>
> >> Please let me know if I'm missing something here.
> >>>
> >>>>>
> >>>>> [1]
> >>>>> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-
> >>>>> aba4b1a441b4@xxxxxxx/
> >>>>>
> >>>>>>
> >>>>>> So in my eyes, this is not a "problem"
> >>>>>
> >>>>> Looks like the kernel scheduled us for a high-priority debate, I hope
> >>>>> there's no deadlock :)
> >>>>>
> >>>>>>
> >>>>>> Cheers,
> >>>>>> -- Nico
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
>