Hey Zach, Thanks for taking the time to review! Zach O'Keefe <zokeefe@xxxxxxxxxx> 于2024年1月18日周四 01:11写道: > > [+linux-mm & others] > > On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@xxxxxxxxx> wrote: > > > > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > > > > Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to > > make a least-effort attempt at a synchronous collapse of memory at > > their own expense. > > > > The only difference from MADV_COLLAPSE is that the new hugepage allocation > > avoids direct reclaim and/or compaction, quickly failing on allocation errors. > > > > The benefits of this approach are: > > > > * CPU is charged to the process that wants to spend the cycles for the THP > > * Avoid unpredictable timing of khugepaged collapse > > * Prevent unpredictable stalls caused by direct reclaim and/or compaction > > > > Semantics > > > > This call is independent of the system-wide THP sysfs settings, but will > > fail for memory marked VM_NOHUGEPAGE. If the ranges provided span > > multiple VMAs, the semantics of the collapse over each VMA is independent > > from the others. This implies a hugepage cannot cross a VMA boundary. If > > collapse of a given hugepage-aligned/sized region fails, the operation may > > continue to attempt collapsing the remainder of memory specified. > > > > The memory ranges provided must be page-aligned, but are not required to > > be hugepage-aligned. If the memory ranges are not hugepage-aligned, the > > start/end of the range will be clamped to the first/last hugepage-aligned > > address covered by said range. The memory ranges must span at least one > > hugepage-sized region. > > > > All non-resident pages covered by the range will first be > > swapped/faulted-in, before being internally copied onto a freshly > > allocated hugepage. Unmapped pages will have their data directly > > initialized to 0 in the new hugepage. However, for every eligible > > hugepage aligned/sized region to-be collapsed, at least one page must > > currently be backed by memory (a PMD covering the address range must > > already exist). > > > > Allocation for the new hugepage will not enter direct reclaim and/or > > compaction, quickly failing if allocation fails. When the system has > > multiple NUMA nodes, the hugepage will be allocated from the node providing > > the most native pages. This operation operates on the current state of the > > specified process and makes no persistent changes or guarantees on how pages > > will be mapped, constructed, or faulted in the future. > > > > Return Value > > > > If all hugepage-sized/aligned regions covered by the provided range were > > either successfully collapsed, or were already PMD-mapped THPs, this > > operation will be deemed successful. On success, madvise(2) returns 0. > > Else, -1 is returned and errno is set to indicate the error for the > > most-recently attempted hugepage collapse. Note that many failures might > > have occurred, since the operation may continue to collapse in the event a > > single hugepage-sized/aligned region fails. > > > > ENOMEM Memory allocation failed or VMA not found > > EBUSY Memcg charging failed > > EAGAIN Required resource temporarily unavailable. Try again > > might succeed. > > EINVAL Other error: No PMD found, subpage doesn't have Present > > bit set, "Special" page no backed by struct page, VMA > > incorrectly sized, address not page-aligned, ... > > > > Use Cases > > > > An immediate user of this new functionality is the Go runtime heap allocator > > that manages memory in hugepage-sized chunks. In the past, whether it was a > > newly allocated chunk through mmap() or a reused chunk released by > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] > > respectively. However, both approaches resulted in performance issues; for > > both scenarios, there could be entries into direct reclaim and/or compaction, > > leading to unpredictable stalls[4]. Now, the allocator can confidently use > > madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages. > > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af > > [4] https://github.com/golang/go/issues/63334 > > Thanks for the patch, Lance, and thanks for providing the links above, > referring to issues Go has seen. > > I've reached out to the Go team to try and understand their use case, > and how we could help. It's not immediately clear whether a > lighter-weight MADV_COLLAPSE is the answer, but it could turn out to > be. > > That said, with respect to the implementation, should a need for a > lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see > process_madvise(2) be the "v2" of madvise(2), where we can start I agree with you; it's a good idea! > leveraging the forward-facing flags argument for these different > advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa > ("mm/madvise: remove racy mm ownership check") so that > process_madvise(2) can always operate on self. IIRC, this was ~ the > plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a > sane default, and implement options in flags down the line). > > That flag could be a MADV_F_COLLAPSE_LIGHT, where we use a lighter The name MADV_F_COLLAPSE_LIGHT sounds great for the flag, and its semantics are very clear. Thanks again for your review and your suggestion! Lance > allocation context, as well as, for example, only do a local > lru_add_drain() vs lru_add_drain_all(). But I'll refrain from thinking > too hard about it just yet. > > Best, > Zach > > > > > > Signed-off-by: Lance Yang <ioworker0@xxxxxxxxx> > > --- > > arch/alpha/include/uapi/asm/mman.h | 1 + > > arch/mips/include/uapi/asm/mman.h | 1 + > > arch/parisc/include/uapi/asm/mman.h | 1 + > > arch/xtensa/include/uapi/asm/mman.h | 1 + > > include/linux/huge_mm.h | 5 +++-- > > include/uapi/asm-generic/mman-common.h | 1 + > > mm/khugepaged.c | 19 ++++++++++++++++--- > > mm/madvise.c | 8 +++++++- > > tools/include/uapi/asm-generic/mman-common.h | 1 + > > 9 files changed, 32 insertions(+), 6 deletions(-) > > > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > > index 763929e814e9..44aa1f57a982 100644 > > --- a/arch/alpha/include/uapi/asm/mman.h > > +++ b/arch/alpha/include/uapi/asm/mman.h > > @@ -77,6 +77,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > > index c6e1fc77c996..1ae16e5d7dfc 100644 > > --- a/arch/mips/include/uapi/asm/mman.h > > +++ b/arch/mips/include/uapi/asm/mman.h > > @@ -104,6 +104,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > > index 68c44f99bc93..f8d016ee1f98 100644 > > --- a/arch/parisc/include/uapi/asm/mman.h > > +++ b/arch/parisc/include/uapi/asm/mman.h > > @@ -71,6 +71,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > #define MADV_HWPOISON 100 /* poison a page for testing */ > > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > > index 1ff0c858544f..c495d1b39c83 100644 > > --- a/arch/xtensa/include/uapi/asm/mman.h > > +++ b/arch/xtensa/include/uapi/asm/mman.h > > @@ -112,6 +112,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index 5adb86af35fc..e1af75aa18fb 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, > > int advice); > > int madvise_collapse(struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > - unsigned long start, unsigned long end); > > + unsigned long start, unsigned long end, bool is_try); > > void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, > > unsigned long end, long adjust_next); > > spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); > > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, > > > > static inline int madvise_collapse(struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > - unsigned long start, unsigned long end) > > + unsigned long start, unsigned long end, > > + bool is_try) > > { > > return -EINVAL; > > } > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > > index 6ce1f1ceb432..a9e5273db5f6 100644 > > --- a/include/uapi/asm-generic/mman-common.h > > +++ b/include/uapi/asm-generic/mman-common.h > > @@ -78,6 +78,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 2b219acb528e..c22703155b6e 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -96,6 +96,7 @@ static struct kmem_cache *mm_slot_cache __ro_after_init; > > > > struct collapse_control { > > bool is_khugepaged; > > + bool is_try; > > > > /* Num pages scanned per node */ > > u32 node_load[MAX_NUMNODES]; > > @@ -1058,10 +1059,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, > > static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, > > struct collapse_control *cc) > > { > > - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : > > - GFP_TRANSHUGE); > > int node = hpage_collapse_find_target_node(cc); > > struct folio *folio; > > + gfp_t gfp; > > + > > + if (cc->is_khugepaged) > > + gfp = alloc_hugepage_khugepaged_gfpmask(); > > + else > > + gfp = cc->is_try ? GFP_TRANSHUGE_LIGHT : GFP_TRANSHUGE; > > > > if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) { > > *hpage = NULL; > > @@ -2697,7 +2702,7 @@ static int madvise_collapse_errno(enum scan_result r) > > } > > > > int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > - unsigned long start, unsigned long end) > > + unsigned long start, unsigned long end, bool is_try) > > { > > struct collapse_control *cc; > > struct mm_struct *mm = vma->vm_mm; > > @@ -2718,6 +2723,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > if (!cc) > > return -ENOMEM; > > cc->is_khugepaged = false; > > + cc->is_try = is_try; > > > > mmgrab(mm); > > lru_add_drain_all(); > > @@ -2773,6 +2779,13 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > > result = collapse_pte_mapped_thp(mm, addr, true); > > mmap_read_unlock(mm); > > goto handle_result; > > + /* MADV_TRY_COLLAPSE: fail quickly */ > > + case SCAN_ALLOC_HUGE_PAGE_FAIL: > > + case SCAN_CGROUP_CHARGE_FAIL: > > + if (cc->is_try) { > > + last_fail = result; > > + goto out_maybelock; > > + } > > /* Whitelisted set of results where continuing OK */ > > case SCAN_PMD_NULL: > > case SCAN_PTE_NON_PRESENT: > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 912155a94ed5..5a359bcd286c 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) > > case MADV_POPULATE_READ: > > case MADV_POPULATE_WRITE: > > case MADV_COLLAPSE: > > + case MADV_TRY_COLLAPSE: > > return 0; > > default: > > /* be safe, default to 1. list exceptions explicitly */ > > @@ -1082,8 +1083,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > > if (error) > > goto out; > > break; > > + case MADV_TRY_COLLAPSE: > > + return madvise_collapse(vma, prev, start, end, true); > > case MADV_COLLAPSE: > > - return madvise_collapse(vma, prev, start, end); > > + return madvise_collapse(vma, prev, start, end, false); > > } > > > > anon_name = anon_vma_name(vma); > > @@ -1178,6 +1181,7 @@ madvise_behavior_valid(int behavior) > > case MADV_HUGEPAGE: > > case MADV_NOHUGEPAGE: > > case MADV_COLLAPSE: > > + case MADV_TRY_COLLAPSE: > > #endif > > case MADV_DONTDUMP: > > case MADV_DODUMP: > > @@ -1368,6 +1372,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > > * transparent huge pages so the existing pages will not be > > * coalesced into THP and new pages will not be allocated as THP. > > * MADV_COLLAPSE - synchronously coalesce pages into new THP. > > + * MADV_TRY_COLLAPSE - similar to COLLAPSE, but avoids direct reclaim > > + * and/or compaction. > > * MADV_DONTDUMP - the application wants to prevent pages in the given range > > * from being included in its core dump. > > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h > > index 6ce1f1ceb432..a9e5273db5f6 100644 > > --- a/tools/include/uapi/asm-generic/mman-common.h > > +++ b/tools/include/uapi/asm-generic/mman-common.h > > @@ -78,6 +78,7 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ > > > > /* compatibility flags */ > > #define MAP_FILE 0 > > -- > > 2.33.1 > >