This series adds a multi-page clearing primitive, clear_pages(), which enables more effective use of x86 string instructions by advertising the real region-size to be cleared. Region-size can be used as a hint by uarchs to optimize the clearing. Also add allow_resched() which marks a code-section as allowing rescheduling in the irqentry_exit path. This allows clear_pages() to get by without having to call cond_sched() periodically. (preempt_model_full() already handles this via irqentry_exit_cond_resched(), so we handle this similarly for preempt_model_none() and preempt_model_voluntary().) Performance == With this demand fault performance gets a decent increase: *Milan* mm/clear_huge_page x86/clear_huge_page change (GB/s) (GB/s) pg-sz=2MB 14.55 19.29 +32.5% pg-sz=1GB 19.34 49.60 +156.4% Milan (and some other AMD Zen uarchs tested) take advantage of the hint to elide cacheline allocation for pg-sz=1GB. The cut-off for this optimization seems to be at around region-size > LLC-size so the pg-sz=2MB load still allocates cachelines. *Icelakex* mm/clear_huge_page x86/clear_huge_page change (GB/s) (GB/s) pg-sz=2MB 9.19 12.94 +40.8% pg-sz=1GB 9.36 12.97 +38.5% Icelakex sees a decent improvement in performance but for both region-sizes does continue to allocate cachelines. Negative: there is, a downside to clearing in larger chunks: the current approach clears page-at-a-time, narrowing towards the faulting subpage. This has better cache characteristics for some sequential access workloads where subpages near the faulting page have a greater likelihood of access. I'm not sure if there are real cases which care about this workload but one example is the vm-scalability/case-anon-w-seq-hugetlb test. This test starts a process for each online CPU, with each process writing sequentially to its set of hugepages. The bottleneck here is the memory pipe and so the improvement in stime is limited, and because the clearing is less cache-optimal now, utime suffers from worse user cache misses. *Icelakex* mm/clear_huge_page x86/clear_huge_page change (tasks=128, mem=4GB/task) stime 286.8 +- 3.6% 243.9 +- 4.1% -14.9% utime 497.7 +- 4.1% 553.5 +- 2.0% +11.2% wall-clock 6.9 +- 2.8% 7.0 +- 1.4% + 1.4% *Milan* mm/clear_huge_page x86/clear_huge_page change (mem=1GB/task, tasks=512) stime 501.3 +- 1.4% 498.0 +- 0.9% -0.5% utime 298.7 +- 1.1% 335.0 +- 2.2% +12.1% wall-clock 3.5 +- 2.8% 3.8 +- 2.6% +8.5% The same test performs better if we have a smaller number of processes, since there is more backend BW available, and thus the improved stime compensates for the worse utime. This could be improved by using more circuitous chunking (somewhat like this: https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@xxxxxxxxxx/). But I'm not sure if it is worth doing. Opinions? Patches == Patch 1, 2, 3: "mm/clear_huge_page: allow arch override for clear_huge_page()", "mm/huge_page: separate clear_huge_page() and copy_huge_page()", "mm/huge_page: cleanup clear_/copy_subpage()" are minor. The first one allows clear_huge_page() to have an arch specific version and the other two are mechanical cleanup patches. Patches 3, 4, 5: "x86/clear_page: extend clear_page*() for multi-page clearing", "x86/clear_page: add clear_pages()", "x86/clear_huge_page: multi-page clearing" define the x86 specific clear_pages() and clear_huge_pages(). Patches 6, 7, 8: "sched: define TIF_ALLOW_RESCHED" "irqentry: define irqentry_exit_allow_resched()" which defines allow_resched() to demarcate preemptible sections. This gets used in patch 9: "x86/clear_huge_page: make clear_contig_region() preemptible". Changelog: v2: - Addressed review comments from peterz, tglx. - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages() - General code cleanup Also at: github.com/terminus/linux clear-pages.v2 Comments appreciated! Ankur Arora (9): mm/clear_huge_page: allow arch override for clear_huge_page() mm/huge_page: separate clear_huge_page() and copy_huge_page() mm/huge_page: cleanup clear_/copy_subpage() x86/clear_page: extend clear_page*() for multi-page clearing x86/clear_page: add clear_pages() x86/clear_huge_page: multi-page clearing sched: define TIF_ALLOW_RESCHED irqentry: define irqentry_exit_allow_resched() x86/clear_huge_page: make clear_contig_region() preemptible arch/x86/include/asm/page_64.h | 27 +++-- arch/x86/include/asm/thread_info.h | 2 + arch/x86/lib/clear_page_64.S | 52 ++++++--- arch/x86/mm/hugetlbpage.c | 59 ++++++++++ include/linux/entry-common.h | 13 +++ include/linux/sched.h | 30 +++++ kernel/entry/common.c | 13 ++- kernel/sched/core.c | 32 ++--- mm/memory.c | 181 +++++++++++++++++------------ 9 files changed, 297 insertions(+), 112 deletions(-) -- 2.31.1