On 17/04/2023 09:19, Yin, Fengwei wrote: > > > On 4/14/2023 9:02 PM, Ryan Roberts wrote: >> Hi All, >> >> This is a second RFC and my first proper attempt at implementing variable order, >> large folios for anonymous memory. The first RFC [1], was a partial >> implementation and a plea for help in debugging an issue I was hitting; thanks >> to Yin Fengwei and Matthew Wilcox for their advice in solving that! >> >> The objective of variable order anonymous folios is to improve performance by >> allocating larger chunks of memory during anonymous page faults: >> >> - Since SW (the kernel) is dealing with larger chunks of memory than base >> pages, there are efficiency savings to be had; fewer page faults, batched PTE >> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel >> overhead. This should benefit all architectures. >> - Since we are now mapping physically contiguous chunks of memory, we can take >> advantage of HW TLB compression techniques. A reduction in TLB pressure >> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce >> TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2]. >> >> This patch set deals with the SW side of things only but sets us up nicely for >> taking advantage of the HW improvements in the near future. >> >> I'm not yet benchmarking a wide variety of use cases, but those that I have >> looked at are positive; I see kernel compilation time improved by up to 10%, >> which I expect to improve further once I add in the arm64 "contiguous bit". >> Memory consumption is somewhere between 1% less and 2% more, depending on how >> its measured. More on perf and memory below. >> >> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor >> conflict resolution). I have a tree at [4]. >> >> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@xxxxxxx/ >> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@xxxxxxx/ >> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@xxxxxxxxxxxxx/ >> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2 >> >> Approach >> ======== >> >> There are 4 fault paths that have been modified: >> - write fault on unallocated address: do_anonymous_page() >> - write fault on zero page: wp_page_copy() >> - write fault on non-exclusive CoW page: wp_page_copy() >> - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() >> >> In the first 2 cases, we will determine the preferred order folio to allocate, >> limited by a max order (currently order-4; see below), VMA and PMD bounds, and >> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order >> folio as the source, subject to constraints that may arise if the source has >> been mremapped or partially munmapped. And in the 4th case, we reuse as much of >> the folio as we can, subject to the same mremap/munmap constraints. >> >> If allocation of our preferred folio order fails, we gracefully fall back to >> lower orders all the way to 0. >> >> Note that none of this affects the behavior of traditional PMD-sized THP. If we >> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings. >> >> Open Questions >> ============== >> >> How to Move Forwards >> -------------------- >> >> While the series is a small-ish code change, it represents a big shift in the >> way things are done. So I'd appreciate any help in scaling up performance >> testing, review and general advice on how best to guide a change like this into >> the kernel. >> >> Folio Allocation Order Policy >> ----------------------------- >> >> The current code is hardcoded to use a maximum order of 4. This was chosen for a >> couple of reasons: >> - From the SW performance perspective, I see a knee around here where >> increasing it doesn't lead to much more performance gain. >> - Intuitively I assume that higher orders become increasingly difficult to >> allocate. >> - From the HW performance perspective, arm64's HPA works on order-2 blocks and >> "the contiguous bit" works on order-4 for 4KB base pages (although it's >> order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going >> any higher. >> >> I suggest that ultimately setting the max order should be left to the >> architecture. arm64 would take advantage of this and set it to the order >> required for the contiguous bit for the configured base page size. >> >> However, I also have a (mild) concern about increased memory consumption. If an >> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB) >> we would end up allocating 16x as much memory as we used to. One potential >> approach I see here is to track fault addresses per-VMA, and increase a per-VMA >> max allocation order for consecutive faults that extend a contiguous range, and >> decrement when discontiguous. Alternatively/additionally, we could use the VMA >> size as an indicator. I'd be interested in your thoughts/opinions. >> >> Deferred Split Queue Lock Contention >> ------------------------------------ >> >> The results below show that we are spending a much greater proportion of time in >> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs. >> >> I think this is (at least partially) related for contention on the deferred >> split queue lock. This is a per-memcg spinlock, which means a single spinlock >> shared among all 160 CPUs. I've solved part of the problem with the last patch >> in the series (which cuts down the need to take the lock), but at folio free >> time (free_transhuge_page()), the lock is still taken and I think this could be >> a problem. Now that most anonymous pages are large folios, this lock is taken a >> lot more. >> >> I think we could probably avoid taking the lock unless !list_empty(), but I >> haven't convinced myself its definitely safe, so haven't applied it yet. >> >> Roadmap >> ======= >> >> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous >> bit" on arm64 to validate predictions about HW speedups. >> >> I also think there are some opportunities with madvise to split folios to non-0 >> orders, which might improve performance in some cases. madvise is also mistaking >> exclusive large folios for non-exclusive ones at the moment (due to the "small >> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly >> frees the folio. >> >> Results >> ======= >> >> Performance >> ----------- >> >> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and >> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned >> before each run. >> >> make defconfig && time make -jN Image >> >> First with -j8: >> >> | | baseline time | anonfolio time | percent change | >> | | to compile (s) | to compile (s) | SMALLER=better | >> |-----------|---------------:|---------------:|---------------:| >> | real-time | 373.0 | 342.8 | -8.1% | >> | user-time | 2333.9 | 2275.3 | -2.5% | >> | sys-time | 510.7 | 340.9 | -33.3% | >> >> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel >> execution. The next 2 tables show a breakdown of the cycles spent in the kernel >> for the 8 job config: >> >> | | baseline | anonfolio | percent change | >> | | (cycles) | (cycles) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | data abort | 683B | 316B | -53.8% | >> | instruction abort | 93B | 76B | -18.4% | >> | syscall | 887B | 767B | -13.6% | >> >> | | baseline | anonfolio | percent change | >> | | (cycles) | (cycles) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | arm64_sys_openat | 194B | 188B | -3.3% | >> | arm64_sys_exit_group | 192B | 124B | -35.7% | >> | arm64_sys_read | 124B | 108B | -12.7% | >> | arm64_sys_execve | 75B | 67B | -11.0% | >> | arm64_sys_mmap | 51B | 50B | -3.0% | >> | arm64_sys_mprotect | 15B | 13B | -12.0% | >> | arm64_sys_write | 43B | 42B | -2.9% | >> | arm64_sys_munmap | 15B | 12B | -17.0% | >> | arm64_sys_newfstatat | 46B | 41B | -9.7% | >> | arm64_sys_clone | 26B | 24B | -10.0% | >> >> And now with -j160: >> >> | | baseline time | anonfolio time | percent change | >> | | to compile (s) | to compile (s) | SMALLER=better | >> |-----------|---------------:|---------------:|---------------:| >> | real-time | 53.7 | 48.2 | -10.2% | >> | user-time | 2705.8 | 2842.1 | 5.0% | >> | sys-time | 1370.4 | 1064.3 | -22.3% | >> >> Above shows a 10.2% improvement in real time execution. But ~3x more time is >> spent in the kernel than for the -j8 config. I think this is related to the lock >> contention issue I highlighted above, but haven't bottomed it out yet. It's also >> not yet clear to me why user-time increases by 5%. >> >> I've also run all the will-it-scale microbenchmarks for a single task, using the >> process mode. Results for multiple runs on the same kernel are noisy - I see ~5% >> fluctuation. So I'm just calling out tests with results that have gt 5% >> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests >> are regressed: >> >> | benchmark | baseline | anonfolio | percent change | >> | | ops/s | ops/s | BIGGER=better | >> | ---------------------|---------:|----------:|---------------:| >> | context_switch1.csv | 328744 | 351150 | 6.8% | >> | malloc1.csv | 96214 | 50890 | -47.1% | >> | mmap1.csv | 410253 | 375746 | -8.4% | >> | page_fault1.csv | 624061 | 3185678 | 410.5% | >> | page_fault2.csv | 416483 | 557448 | 33.8% | >> | page_fault3.csv | 724566 | 1152726 | 59.1% | >> | read1.csv | 1806908 | 1905752 | 5.5% | >> | read2.csv | 587722 | 1942062 | 230.4% | >> | tlb_flush1.csv | 143910 | 152097 | 5.7% | >> | tlb_flush2.csv | 266763 | 322320 | 20.8% | >> >> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M >> object in a loop and never touches the allocated memory. I think the malloc >> implementation is maintaining a header just before the allocated object, which >> causes a single page fault. Previously that page fault allocated 1 page. Now it >> is allocating 16 pages. This cost would be repaid if the test code wrote to the >> allocated object. Alternatively the folio allocation order policy described >> above would also solve this. >> >> It is not clear to me why mmap1 has slowed down. This remains a todo. >> >> Memory >> ------ >> >> I measured memory consumption while doing a kernel compile with 8 jobs on a >> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the >> workload, then calcualted "memory used" high and low watermarks using both >> MemFree and MemAvailable. If there is a better way of measuring system memory >> consumption, please let me know! >> >> mem-used = 4GB - /proc/meminfo:MemFree >> >> | | baseline | anonfolio | percent change | >> | | (MB) | (MB) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | mem-used-low | 825 | 842 | 2.1% | >> | mem-used-high | 2697 | 2672 | -0.9% | >> >> mem-used = 4GB - /proc/meminfo:MemAvailable >> >> | | baseline | anonfolio | percent change | >> | | (MB) | (MB) | SMALLER=better | >> |----------------------|---------:|----------:|---------------:| >> | mem-used-low | 518 | 530 | 2.3% | >> | mem-used-high | 1522 | 1537 | 1.0% | >> >> For the high watermark, the methods disagree; we are either saving 1% or using >> 1% more. For the low watermark, both methods agree that we are using about 2% >> more. I plan to investigate whether the proposed folio allocation order policy >> can reduce this to zero. > > Besides the memory consumption, the large folio could impact the tail latency > of page allocation also (extra zeroing memory, more operations on slow path of > page allocation). I agree; this series could be thought of as trading latency for throughput. There are a couple of mitigations to try to limit the extra latency; on the Altra at least, we are taking advantage of the CPU's streaming mode for the memory zeroing - its >2x faster to zero a block of 64K than it is to zero 16x 4K. And currently I'm using __GPF_NORETRY for high order folio allocations, which I understood to mean we shouldn't suffer high allocation latency due to reclaim, etc. Although, I know we are discussing whether this is correct in the thread for patch 3. > > Again, my understanding is the tail latency of page allocation is more important > to anonymous page than page cache page because anonymous page is allocated/freed > more frequently. I had assumed that any serious application that needs to guarantee no latency due to page faults would preallocate and lock all memory during init? Perhaps that's wishful thinking? > > > Regards > Yin, Fengwei > >> >> Thanks for making it this far! >> Ryan >> >> >> Ryan Roberts (17): >> mm: Expose clear_huge_page() unconditionally >> mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() >> mm: Introduce try_vma_alloc_movable_folio() >> mm: Implement folio_add_new_anon_rmap_range() >> mm: Routines to determine max anon folio allocation order >> mm: Allocate large folios for anonymous memory >> mm: Allow deferred splitting of arbitrary large anon folios >> mm: Implement folio_move_anon_rmap_range() >> mm: Update wp_page_reuse() to operate on range of pages >> mm: Reuse large folios for anonymous memory >> mm: Split __wp_page_copy_user() into 2 variants >> mm: ptep_clear_flush_range_notify() macro for batch operation >> mm: Implement folio_remove_rmap_range() >> mm: Copy large folios for anonymous memory >> mm: Convert zero page to large folios on write >> mm: mmap: Align unhinted maps to highest anon folio order >> mm: Batch-zap large anonymous folio PTE mappings >> >> arch/alpha/include/asm/page.h | 5 +- >> arch/arm64/include/asm/page.h | 3 +- >> arch/arm64/mm/fault.c | 7 +- >> arch/ia64/include/asm/page.h | 5 +- >> arch/m68k/include/asm/page_no.h | 7 +- >> arch/s390/include/asm/page.h | 5 +- >> arch/x86/include/asm/page.h | 5 +- >> include/linux/highmem.h | 23 +- >> include/linux/mm.h | 8 +- >> include/linux/mmu_notifier.h | 31 ++ >> include/linux/rmap.h | 6 + >> mm/memory.c | 877 ++++++++++++++++++++++++++++---- >> mm/mmap.c | 4 +- >> mm/rmap.c | 147 +++++- >> 14 files changed, 1000 insertions(+), 133 deletions(-) >> >> -- >> 2.25.1 >>