On 16/01/2025 20:53, Nico Pache wrote: > On Thu, Jan 16, 2025 at 2:47 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >> >> Hi Nico, > Hi Ryan! >> >> On 08/01/2025 23:31, Nico Pache wrote: >>> The following series provides khugepaged and madvise collapse with the >>> capability to collapse regions to mTHPs. >> >> It's great to see multiple solutions for this feature being posted; I guess that >> leaves us with the luxurious problem of figuring out an uber-patchset that >> incorporates the best of both? :) > I guess so! My motivation for developing this was inspired by my > 'defer' RFC. Which can't really live without khugepaged having mTHP > support (ie having 32k mTHP= always and global=defer doesnt make > sense). I'm not sure why that wouldn't make sense? setting global=defer would only be picked up for a given size that sets "inherit". So "32k=always, 2m=inherit, global=defer" is the same as "32k=always, 2m=defer"; which means you would try to allocate 32K directly in the fault handler and defer collapse to 2m to khugepaged. I guess where it would get difficult is if you set a size less than PMD-size to defer; at the moment khugepaged can't actually do that; it would just end up collapsing to 2M? Anyway, I'm rambling... I get your point. >> >> I haven't had a chance to review your series in detail yet, but have a few >> questions below that will help me understand the key differences between your >> series and Dev's. >> >>> >>> To achieve this we generalize the khugepaged functions to no longer depend >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages >>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked >>> using a bitmap. After the PMD scan is done, we do binary recursion on the >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction >>> on max_ptes_none is removed during the scan, to make sure we account for >>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to >>> determine how full a mTHP order needs to be before collapsing it. >>> >>> Some design choices to note: >>> - bitmap structures are allocated dynamically because on some arch's >>> (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at >>> compile time leading to warnings. >> >> We have MAX_PTRS_PER_PTE and friends though, which are worst case and compile >> time. Could these help avoid the dynamic allocation? >> >> MAX_PMD_ORDER = ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE) > is the MAX_PMD_ORDER = PMD_ORDER? if not this might introduce weird > edge cases where PMD_ORDER < MAX_PMD_ORDER. No, MAX_PMD_ORDER becomes the largest order that could be configured at boot. PMD_ORDER is what is actually configured at boot. My understanding was that you were dynamically allocating your bitmap based on the runtime value of PMD_ORDER? I was just suggesting that you could allocate it statically (on stack or whatever) based on MAX_PMD_ORDER, for the worst-case requirement and only actually use the portion required by the runtime PMD_ORDER value. It avoids the kmalloc call. > >> >> Althogh to be honest, it's not super clear to me what the benefit of the bitmap >> is vs just iterating through the PTEs like Dev does; is there a significant cost >> saving in practice? On the face of it, it seems like it might be uneeded complexity. > The bitmap was to encode the state of PMD without needing rescanning > (or refactor a lot of code). We keep the scan runtime constant at 512 > (for x86). Dev did some good analysis for this here > https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@xxxxxxx/ > This prevents needing to hold the read lock for longer, and prevents > needing to reacquire it too. >>>>> - The recursion is masked through a stack structure. >>> - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was >>> 64bit on x86. This provides some optimization on the bitmap operations. >>> if other arches/configs that have larger than 512 PTEs per PMD want to >>> compress their bitmap further we can change this value per arch. >> >> So 1 bit in the bitmap represents 8 pages? And it will only be set if all 8 >> pages are !pte_none()? I'm wondering what will happen if you have a pattern of 4 >> set PTEs followed by 4 none PTEs, followed by 4 set PTEs... If 16K mTHP is >> enabled, you would want to collapse every other 16K block in this case, but I'm >> guessing with your scheme, all the bits will be clear and no collapse will >> occur? But for arm64 at least, collapsing to order-2 (16K) may be desired for HPA. > > Yeah on my V2 ive incorporated a threshold (like max_ptes_none) for > setting the bit. This will covert this case better (given a better > default max_ptes_none). > The way i see it 511 max_ptes_none is just wrong... You mean it's a bad default? > we should flip it > towards the lower end of the scale (ie 64), and the "always" THP > setting should ignore it (like madvise does). But user space can already get that behaviour by modifying the tunable, right? Isn't that just a user space policy choice? One other thing that occurs to me regarding the bitmap; In the context of Dev's series, we have discussed policy for what to do when the source PTEs are backed by a large folio already. I'm guessing if you are making your smaller-than-PMD-size collapse decisions based solely on the bitmap, you won't be able to see when the PTEs are already collpsed for the target order? i.e. let's say you already have a 64K folio fully mapped in an aligned way. You wouldn't want to "re-collapse" it to 64K. Are you robust to this? > >> >>> >>> Patch 1-2: Some refactoring to combine madvise_collapse and khugepaged >>> Patch 3: A minor "fix"/optimization >>> Patch 4: Refactor/rename hpage_collapse >>> Patch 5-7: Generalize khugepaged functions for arbitrary orders >>> Patch 8-11: The mTHP patches >>> >>> This series acts as an alternative to Dev Jain's approach [1]. The two >>> series differ in a few ways: >>> - My approach uses a bitmap to store the state of the linear scan_pmd to >>> then determine potential mTHP batches. Devs incorporates his directly >>> into the scan, and will try each available order. >> >> So if I'm understanding, the benefit of the bitmap is to remove the need to >> re-scan the "low" PTEs when moving to a lower order, which is what Dev's >> approach does? Are there not some locking/consistency issues to manage if not >> re-scanning? > Correct, so far i haven't found any issues (other than the bugs Dev > reported in his review)-- my fixed version of this RFC has been > running fine with no notable locking issues. >> >>> - Dev is attempting to optimize the locking, while my approach keeps the >>> locking changes to a minimum. I believe his changes are not safe for >>> uffd. >> >> I agree; let's keep the locking simple for the initial effort. >> >>> - Dev's changes only work for khugepaged not madvise_collapse (although >>> i think that was by choice and it could easily support madvise) >> >> I agree supporting MADV_COLLAPSE is good; what exactly are the semantics for it >> though? I think it ignores the sysfs settings (max_ptes_none and friends) so >> presumably it will continue to be much more greedy about collapsing to the >> highest possible order and only fall back to lower orders if the VMA boundaries >> force it to or if the higher order allocation fails? > Kind of, because I removed the max_ptes_none check during the scan, > and reintroduced it in the bitmap scan (without a madvise > restriction), MADV_COLLAPSE and khugepaged will work more similarly. >> >>> - Dev scales all khugepaged sysfs tunables by order, while im removing >>> the restriction of max_ptes_none and converting it to a scale to >>> determine a (m)THP threshold. >> >> I don't really understand this statement. You say you are removing the >> restriction of max_ptes_none. But then you say you scale it to determine a >> threshold. So are you honoring it or not? And if you're honouring it, how is >> your scaling method different to Dev's? What about the other tunables (shared >> and swap)? > I removed the max_ptes_none restriction during the initial scan, so we > can account for the full PMD (which is what happens with > max_ptes_none=511 anyways). Then max_ptes_none can be used with the > bitmap to calculate a threshold (max_ptes_none=64 == ~90% full) for > finding the optimal mTHP size. > > This RFC scales max_ptes_none to 0-100, but it has some really bad > rounding issues, so instead ive incorporated scaling (via bitshifting) > like Dev did in his series. Ive tested this and it's more accurate > now. >> >>> - Dev turns on khugepaged if any order is available while mine still >>> only runs if PMDs are enabled. I like Dev's approach and will most >>> likely do the same in my PATCH posting. >> >> Agreed. Also, we will want khugepaged to be able to scan VMAs (or parts of VMAs) >> that cover only a partial PMD entry. I think neither of your implementations >> currently do that. As I understand it, Dev's v2 will add that support. Is your >> approach ammeanable to this? > > Yes, I believe so. I'm working on adding this too. > >> >>> - mTHPs need their ref count updated to 1<<order, which Dev is missing. >>> >>> Patch 11 was inspired by one of Dev's changes. >> >> I think the 1 problem that emerged during review of Dev's series, which we don't >> have a proper solution to yet, is the issue of "creep", where regions can be >> collapsed to progressively higher orders through iterative scans. At each >> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse >> effectively adds more non-none ptes so the next scan will then collapse to even >> higher order. Does your solution suffer from this (theoretical/edge case) issue? >> If not, how did you solve? > > Yes sadly it suffers from the same issue. bringing max_ptes_none much > lower as a default would "help". > I liked Zi Yan's solution of a per-VMA bit that gets set when > khugepaged collapses, and unset when the VMA changes (pf, realloc, > etc). > Then khugepaged can only operate on VMAs that dont have the bit set. > This way we only collapse once, unless the mapping was changed. Dev raised the issue in discussion against his series, that currently khugepaged doesn't scan the entire VMA, it scans to the first PMD that it collapses then moves to another VMA. I guess that's a fairness thing. So a VMA flag won't quite do the trick assuming we want to continue with that behavior. Perhaps we could keep a "cursor" in the VMA though, which describes the starting address of the next scan. We can move it forwards as we scan. And move it backwards when taking a fault. Still not perfect, but perhaps good enough? > > Could we map the new "non-none" pages to the zero page (rather than > actually zeroing the page), so they dont actually act as new "utilized > pages" and are still counted as none pages during the scan (until they > are written to)? I think you are propsing to use the zero page as a PTE marker to say "this region is scheduled for collapse"? In which case, why not just use a PTE marker... But you still have to do the collapse at some point (which I guess you are now deferring to the next page fault that hits one of those markers)? Once you have collapsed, you're still back to the original issue. So I don't think it's bought you anything except complexity and more latency :) Thanks, Ryan > >> >> Thanks, >> Ryan > > Cheers! > -- Nico > >> >> >>> >>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@xxxxxxx/ >>> >>> Nico Pache (11): >>> introduce khugepaged_collapse_single_pmd to collapse a single pmd >>> khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot >>> khugepaged: Don't allocate khugepaged mm_slot early >>> khugepaged: rename hpage_collapse_* to khugepaged_* >>> khugepaged: generalize hugepage_vma_revalidate for mTHP support >>> khugepaged: generalize alloc_charge_folio for mTHP support >>> khugepaged: generalize __collapse_huge_page_* for mTHP support >>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support >>> khugepaged: add mTHP support >>> khugepaged: remove max_ptes_none restriction on the pmd scan >>> khugepaged: skip collapsing mTHP to smaller orders >>> >>> include/linux/khugepaged.h | 4 +- >>> mm/huge_memory.c | 3 +- >>> mm/khugepaged.c | 436 +++++++++++++++++++++++++------------ >>> 3 files changed, 306 insertions(+), 137 deletions(-) >>> >> >