On Tue, Sep 19, 2023 at 11:28:30AM +0200, Peter Zijlstra wrote: > On Tue, Aug 29, 2023 at 11:36:08AM +0530, Raghavendra K T wrote: > > > Peter Zijlstra (1): > > sched/numa: Increase tasks' access history > > > > Raghavendra K T (5): > > sched/numa: Move up the access pid reset logic > > sched/numa: Add disjoint vma unconditional scan logic > > sched/numa: Remove unconditional scan logic using mm numa_scan_seq > > sched/numa: Allow recently accessed VMAs to be scanned > > sched/numa: Allow scanning of shared VMAs > > > > include/linux/mm.h | 12 +++-- > > include/linux/mm_types.h | 5 +- > > kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++------- > > 3 files changed, 102 insertions(+), 24 deletions(-) > > So I don't immediately see anything horrible with this. Mel, do you have > a few cycles to go over this as well? I've been trying my best to find the necessary time and it's still on my radar for this week. Preliminary results don't look great for the first part of the series up to the patch "sched/numa: Add disjoint vma unconditional scan logic" even though other reports indicate the performance may be fixed up later in the series. For example autonumabench 6.5.0-rc6 6.5.0-rc6 sched-pidclear-v1r5 sched-forcescan-v1r5 Min syst-NUMA02 1.94 ( 0.00%) 1.38 ( 28.87%) Min elsp-NUMA02 12.67 ( 0.00%) 21.02 ( -65.90%) Amean syst-NUMA02 2.35 ( 0.00%) 1.86 ( 21.13%) Amean elsp-NUMA02 12.93 ( 0.00%) 21.69 * -67.76%* Stddev syst-NUMA02 0.54 ( 0.00%) 0.90 ( -67.67%) Stddev elsp-NUMA02 0.18 ( 0.00%) 0.44 (-144.19%) CoeffVar syst-NUMA02 22.82 ( 0.00%) 48.50 (-112.58%) CoeffVar elsp-NUMA02 1.38 ( 0.00%) 2.01 ( -45.56%) Max syst-NUMA02 3.15 ( 0.00%) 3.89 ( -23.49%) Max elsp-NUMA02 13.16 ( 0.00%) 22.36 ( -69.91%) BAmean-50 syst-NUMA02 2.01 ( 0.00%) 1.45 ( 27.69%) BAmean-50 elsp-NUMA02 12.77 ( 0.00%) 21.34 ( -67.04%) BAmean-95 syst-NUMA02 2.22 ( 0.00%) 1.52 ( 31.68%) BAmean-95 elsp-NUMA02 12.89 ( 0.00%) 21.58 ( -67.39%) BAmean-99 syst-NUMA02 2.22 ( 0.00%) 1.52 ( 31.68%) BAmean-99 elsp-NUMA02 12.89 ( 0.00%) 21.58 ( -67.39%) 6.5.0-rc6 6.5.0-rc6 sched-pidclear-v1r5sched-forcescan-v1r5 Duration User 5702.00 10264.25 Duration System 17.02 13.59 Duration Elapsed 92.57 156.30 Similar results seen across multiple machines. It's not universally bad but the NUMA02 tests appear to suffer quite badly and while not realistic, they are somewhat relevant because numa02 is likely an "adverse workload" for the logic that skips VMAs based on PID accesses. For the rest of the series, the changelogs lacked detail on why those changes helped. Patch 4's changelog lacks detail and patch 6 stating "VMAs being accessed by more than two tasks are critical" is not helpful either -- e.g. why are they critical? They are obviously shared VMAs and therefore it may be the case that they need to be identified and interleaved quickly but maybe not. Is the shared VMA that is critical a large malloc'd area split into per-thread sections or something that is MAP_SHARED? The changelog doesn't say so I have to guess. There are also a bunch of magic variables with limited explanation (e.g. why NR_ACCESS_PID_HIST==4 and SHARED_VMA_THRESH=3?), the numab fields are not documented and the changelogs lack supporting data. I suspect that patches 3-6 may be dealing with regressions introduced by patch 2, particularly for NUMA02, but I'm not certain as I didn't dedicate the necessary test time to prove that and it's the type of information that should be in the changelog. While there is nothing wrong with that as such, it's very hard to imagine how patches 3-6 work in every case and be certain that the various parameters make sense. That could cause difficulties later in terms of maintenance. My initial thinking was "There should be a standalone series that deals *only* with scanning VMAs that had no fault activity and skipped due to PID hashing". These are important because there may be no fault activity because there is no scan activity which is due to to fault activity. The series is incomplete and without changelogs but I pushed it anyway to https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ sched-numabselective-v1r5 The first two patches simply improve the documentation on what is going on, patch 3 adds a tracepoint for figuring out why VMAs were skipped or not skipped. Patch 4 handles a corner case to complete the scan of a VMA once it has started regardless of what task is doing the scanning. The last patch scans VMAs that have seen no fault activity once active VMAs have been scanned. It has its weaknesses because it may be overly simplisitic and it forces all VMAs to be scanned on every sequence which is wasteful. It also hurts NUMA02 performance, although not as badly as ""sched/numa: Add disjoint vma unconditional scan logic". On the plus side, it is easier to reason about, it solves only one problem in the series and any patch on top or modification should justify each change individually. -- Mel Gorman SUSE Labs