Mateusz Guzik <mjguzik@xxxxxxxxx> writes: > On Wed, Aug 30, 2023 at 11:49:49AM -0700, Ankur Arora wrote: >> This series adds a multi-page clearing primitive, clear_pages(), >> which enables more effective use of x86 string instructions by >> advertising the real region-size to be cleared. >> >> Region-size can be used as a hint by uarchs to optimize the >> clearing. >> >> Also add allow_resched() which marks a code-section as allowing >> rescheduling in the irqentry_exit path. This allows clear_pages() >> to get by without having to call cond_sched() periodically. >> (preempt_model_full() already handles this via >> irqentry_exit_cond_resched(), so we handle this similarly for >> preempt_model_none() and preempt_model_voluntary().) >> >> Performance >> == >> >> With this demand fault performance gets a decent increase: >> >> *Milan* mm/clear_huge_page x86/clear_huge_page change >> (GB/s) (GB/s) >> >> pg-sz=2MB 14.55 19.29 +32.5% >> pg-sz=1GB 19.34 49.60 +156.4% >> >> Milan (and some other AMD Zen uarchs tested) take advantage of the >> hint to elide cacheline allocation for pg-sz=1GB. The cut-off for >> this optimization seems to be at around region-size > LLC-size so >> the pg-sz=2MB load still allocates cachelines. >> > > Have you benchmarked clzero? It is an AMD-specific instruction issuing > non-temporal stores. It is definitely something to try out for 1G pages. Thanks for the suggestion. Been a little while, but see the numbers here: https://lore.kernel.org/linux-mm/20220606203725.1313715-15-ankur.a.arora@xxxxxxxxxx/ > One would think rep stosq has to be at least not worse since the CPU is > explicitly told what to do and is free to optimize it however it sees > fit, but the rep prefix has a long history of underperforming. I agree that historically REP variants have been all over the place. But, if you look at the numbers, REP; STOS and CLZERO are pretty close, at least for current generation of AMD uarchs. Now, current uarch performance is no guarantee for future uarchs, but if the kernel uses REP; STOS in performance paths, then hopefully they'll also shows up in internal CPU regression benchmarks which might mean that the high performance persists. That said, I think using CLZERO/MOVNT is a good idea -- though, as a fallback option or where it is better to send an explicit hint while say, clearing a 2MB region. Thanks Ankur > I'm not saying it is going to be better, but that this should be tested, > albeit one can easily argue this can be done at a later date. > > > I would do it myself but my access to AMD CPUs is limited. > >> >> *Icelakex* mm/clear_huge_page x86/clear_huge_page change >> (GB/s) (GB/s) >> >> pg-sz=2MB 9.19 12.94 +40.8% >> pg-sz=1GB 9.36 12.97 +38.5% >> >> Icelakex sees a decent improvement in performance but for both >> region-sizes does continue to allocate cachelines. >> >> >> Negative: there is, a downside to clearing in larger chunks: the >> current approach clears page-at-a-time, narrowing towards >> the faulting subpage. This has better cache characteristics for >> some sequential access workloads where subpages near the faulting >> page have a greater likelihood of access. >> >> I'm not sure if there are real cases which care about this workload >> but one example is the vm-scalability/case-anon-w-seq-hugetlb test. >> This test starts a process for each online CPU, with each process >> writing sequentially to its set of hugepages. >> >> The bottleneck here is the memory pipe and so the improvement in >> stime is limited, and because the clearing is less cache-optimal >> now, utime suffers from worse user cache misses. >> >> *Icelakex* mm/clear_huge_page x86/clear_huge_page change >> (tasks=128, mem=4GB/task) >> >> stime 286.8 +- 3.6% 243.9 +- 4.1% -14.9% >> utime 497.7 +- 4.1% 553.5 +- 2.0% +11.2% >> wall-clock 6.9 +- 2.8% 7.0 +- 1.4% + 1.4% >> >> >> *Milan* mm/clear_huge_page x86/clear_huge_page change >> (mem=1GB/task, tasks=512) >> >> stime 501.3 +- 1.4% 498.0 +- 0.9% -0.5% >> utime 298.7 +- 1.1% 335.0 +- 2.2% +12.1% >> wall-clock 3.5 +- 2.8% 3.8 +- 2.6% +8.5% >> >> The same test performs better if we have a smaller number of processes, >> since there is more backend BW available, and thus the improved stime >> compensates for the worse utime. >> >> This could be improved by using more circuitous chunking (somewhat >> like this: >> https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@xxxxxxxxxx/). >> But I'm not sure if it is worth doing. Opinions? >> >> Patches >> == >> >> Patch 1, 2, 3: >> "mm/clear_huge_page: allow arch override for clear_huge_page()", >> "mm/huge_page: separate clear_huge_page() and copy_huge_page()", >> "mm/huge_page: cleanup clear_/copy_subpage()" >> are minor. The first one allows clear_huge_page() to have an >> arch specific version and the other two are mechanical cleanup >> patches. >> >> Patches 3, 4, 5: >> "x86/clear_page: extend clear_page*() for multi-page clearing", >> "x86/clear_page: add clear_pages()", >> "x86/clear_huge_page: multi-page clearing" >> define the x86 specific clear_pages() and clear_huge_pages(). >> >> Patches 6, 7, 8: >> "sched: define TIF_ALLOW_RESCHED" >> "irqentry: define irqentry_exit_allow_resched()" >> which defines allow_resched() to demarcate preemptible sections. >> >> This gets used in patch 9: >> "x86/clear_huge_page: make clear_contig_region() preemptible". >> >> Changelog: >> >> v2: >> - Addressed review comments from peterz, tglx. >> - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages() >> - General code cleanup >> >> Also at: >> github.com/terminus/linux clear-pages.v2 >> >> Comments appreciated! >> >> Ankur Arora (9): >> mm/clear_huge_page: allow arch override for clear_huge_page() >> mm/huge_page: separate clear_huge_page() and copy_huge_page() >> mm/huge_page: cleanup clear_/copy_subpage() >> x86/clear_page: extend clear_page*() for multi-page clearing >> x86/clear_page: add clear_pages() >> x86/clear_huge_page: multi-page clearing >> sched: define TIF_ALLOW_RESCHED >> irqentry: define irqentry_exit_allow_resched() >> x86/clear_huge_page: make clear_contig_region() preemptible >> >> arch/x86/include/asm/page_64.h | 27 +++-- >> arch/x86/include/asm/thread_info.h | 2 + >> arch/x86/lib/clear_page_64.S | 52 ++++++--- >> arch/x86/mm/hugetlbpage.c | 59 ++++++++++ >> include/linux/entry-common.h | 13 +++ >> include/linux/sched.h | 30 +++++ >> kernel/entry/common.c | 13 ++- >> kernel/sched/core.c | 32 ++--- >> mm/memory.c | 181 +++++++++++++++++------------ >> 9 files changed, 297 insertions(+), 112 deletions(-) >> >> -- >> 2.31.1 >> >> -- ankur