The patchset proposes one of the enhancements to numa vma scanning suggested by Mel. This is continuation of [3]. Existing mechanism of scan period involves, scan period derived from per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA fault stats at per-process level to capture aplication behaviour better. During that course of discussion, Mel proposed several ideas to enhance current numa balancing. One of the suggestion was below Track what threads access a VMA. The suggestion was to use an unsigned long pid_mask and use the lower bits to tag approximately what threads access a VMA. Skip VMAs that did not trap a fault. This would be approximate because of PID collisions but would reduce scanning of areas the thread is not interested in. The above suggestion intends not to penalize threads that has no interest in the vma, thus reduce scanning overhead. V3 changes are mostly based on PeterZ comments (details below in changes) Summary of patchset: Current patchset implements: 1. Delay the vma scanning logic for newly created VMA's so that additional overhead of scanning is not incurred for short lived tasks (implementation by Mel) 2. Store the information of tasks accessing VMA in 2 windows. It is regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval. The above time is derived from experimenting (Suggested by PeterZ) to balance between frequent clearing vs obsolete access data 3. hash_32 used to encode task index accessing VMA information 4. VMA's acess information is used to skip scanning for the tasks which had not accessed VMA Things to ponder over: ========================================== - Improvement to clearing accessing PIDs logic (discussed in-detail in patch3 itself (Done in this patchset by implementing 2 window history) - Current scan period is not changed in the patchset, so we do see frequent tries to scan. Relaxing scan period dynamically could improve results further. [1] sched/numa: Process Adaptive autoNUMA Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@xxxxxxx/T/ [2] RFC V1 Link: https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@xxxxxxx/ [3] V2 Link: https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@xxxxxxx/ Changes since V2: patch1: - Renaming of structure, macro to function, - Add explanation to heuristics - Adding more details from result (PeterZ) Patch2: - Usage of test and set bit (PeterZ) - Move storing access PID info to numa_migrate_prep() - Add a note on fainess among tasks allowed to scan (PeterZ) Patch3: - Maintain two windows of access PID information (PeterZ supported implementation and Gave idea to extend to N if needed) Patch4: - Apply hash_32 function to track VMA accessing PIDs (PeterZ) Changes since RFC V1: - Include Mel's vma scan delay patch - Change the accessing pid store logic (Thanks Mel) - Fencing structure / code to NUMA_BALANCING (David, Mel) - Adding clearing access PID logic (Mel) - Descriptive change log ( Mike Rapoport) Results: Summary: Huge autonuma cost reduction seen in mmtest. Kernbench and dbench improvement is around 5% and huge system time (80%+) improvement from mmtest autonuma. kernbench ============= 6.1.0-base 6.1.0-patched Amean user-256 22437.65 ( 0.00%) 22622.16 * -0.82%* Amean syst-256 9290.30 ( 0.00%) 8763.85 * 5.67%* Amean elsp-256 159.36 ( 0.00%) 157.44 * 1.20%* Duration User 67322.16 67876.18 Duration System 27884.89 26306.28 Duration Elapsed 498.95 494.42 Ops NUMA alloc hit 1738904367.00 1738882062.00 Ops NUMA alloc local 1738904104.00 1738881490.00 Ops NUMA base-page range updates 440526.00 272095.00 Ops NUMA PTE updates 440526.00 272095.00 Ops NUMA hint faults 109109.00 55630.00 Ops NUMA hint local faults % 5474.00 196.00 Ops NUMA hint local percent 5.02 0.35 Ops NUMA pages migrated 103400.00 55434.00 Ops AutoNUMA cost 550.59 281.11 autonumabench =============== 6.1.0-base 6.1.0-patched Amean syst-NUMA01 252.55 ( 0.00%) 27.71 * 89.03%* Amean syst-NUMA01_THREADLOCAL 0.20 ( 0.00%) 0.23 * -12.77%* Amean syst-NUMA02 0.91 ( 0.00%) 0.76 * 16.22%* Amean syst-NUMA02_SMT 0.67 ( 0.00%) 0.67 * -1.07%* Amean elsp-NUMA01 269.93 ( 0.00%) 309.44 * -14.64%* Amean elsp-NUMA01_THREADLOCAL 1.05 ( 0.00%) 1.07 * -1.36%* Amean elsp-NUMA02 3.26 ( 0.00%) 3.29 * -0.79%* Amean elsp-NUMA02_SMT 3.73 ( 0.00%) 3.52 * 5.64%* Duration User 318683.69 330084.06 Duration System 1780.77 206.14 Duration Elapsed 1954.30 2233.06 Ops NUMA alloc hit 62237331.00 49179090.00 Ops NUMA alloc local 62235222.00 49177092.00 Ops NUMA base-page range updates 85303091.00 29242.00 Ops NUMA PTE updates 85303091.00 29242.00 Ops NUMA hint faults 87457481.00 8302.00 Ops NUMA hint local faults % 66665145.00 6064.00 Ops NUMA hint local percent 76.23 73.04 Ops NUMA pages migrated 9348511.00 2232.00 Ops AutoNUMA cost 438062.15 41.76 dbench ======== dbench -t 90 <nproc> Throughput #clients base patched %improvement 1 842.655 MB/sec 922.305 MB/sec 9.45 16 5062.82 MB/sec 5079.85 MB/sec 0.34 64 9408.81 MB/sec 9980.89 MB/sec 6.08 256 7076.59 MB/sec 7590.76 MB/sec 7.26 Mel Gorman (1): sched/numa: Apply the scan delay to every new vma Raghavendra K T (3): sched/numa: Enhance vma scanning logic sched/numa: implement access PID reset logic sched/numa: Use hash_32 to mix up PIDs accessing VMA include/linux/mm.h | 30 +++++++++++++++++++++ include/linux/mm_types.h | 9 +++++++ kernel/fork.c | 2 ++ kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++++++++++++ mm/memory.c | 3 +++ 5 files changed, 101 insertions(+) -- 2.34.1