Re: [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning

Raghavendra K T <raghavendra.kt@xxxxxxx> · Wed, 20 Sep 2023 16:12:45 +0530

On 9/19/2023 9:52 PM, Mel Gorman wrote:
On Tue, Sep 19, 2023 at 11:28:30AM +0200, Peter Zijlstra wrote:
On Tue, Aug 29, 2023 at 11:36:08AM +0530, Raghavendra K T wrote:

Peter Zijlstra (1):
   sched/numa: Increase tasks' access history

Raghavendra K T (5):
   sched/numa: Move up the access pid reset logic
   sched/numa: Add disjoint vma unconditional scan logic
   sched/numa: Remove unconditional scan logic using mm numa_scan_seq
   sched/numa: Allow recently accessed VMAs to be scanned
   sched/numa: Allow scanning of shared VMAs

  include/linux/mm.h       |  12 +++--
  include/linux/mm_types.h |   5 +-
  kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
  3 files changed, 102 insertions(+), 24 deletions(-)

So I don't immediately see anything horrible with this. Mel, do you have
a few cycles to go over this as well?

I've been trying my best to find the necessary time and it's still on my
radar for this week. 

Hello Mel,
Thanks you a lot for your time and for having a detailed look, and your
patches.

In summary, I will start with your patchset.
Link:  https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ 
sched-numabselective-v1r5
and see if there is any cumulative benefits from my patches (3-6) on top 
of them.

Trying to give out some details for your questions. please skip if its
long..

Preliminary results don't look great for the first part
of the series up to the patch "sched/numa: Add disjoint vma unconditional
scan logic" even though other reports indicate the performance may be
fixed up later in the series. For example

autonumabench
                                    6.5.0-rc6              6.5.0-rc6
                          sched-pidclear-v1r5   sched-forcescan-v1r5
Min       syst-NUMA02        1.94 (   0.00%)        1.38 (  28.87%)
Min       elsp-NUMA02       12.67 (   0.00%)       21.02 ( -65.90%)
Amean     syst-NUMA02        2.35 (   0.00%)        1.86 (  21.13%)
Amean     elsp-NUMA02       12.93 (   0.00%)       21.69 * -67.76%*
Stddev    syst-NUMA02        0.54 (   0.00%)        0.90 ( -67.67%)
Stddev    elsp-NUMA02        0.18 (   0.00%)        0.44 (-144.19%)
CoeffVar  syst-NUMA02       22.82 (   0.00%)       48.50 (-112.58%)
CoeffVar  elsp-NUMA02        1.38 (   0.00%)        2.01 ( -45.56%)
Max       syst-NUMA02        3.15 (   0.00%)        3.89 ( -23.49%)
Max       elsp-NUMA02       13.16 (   0.00%)       22.36 ( -69.91%)
BAmean-50 syst-NUMA02        2.01 (   0.00%)        1.45 (  27.69%)
BAmean-50 elsp-NUMA02       12.77 (   0.00%)       21.34 ( -67.04%)
BAmean-95 syst-NUMA02        2.22 (   0.00%)        1.52 (  31.68%)
BAmean-95 elsp-NUMA02       12.89 (   0.00%)       21.58 ( -67.39%)
BAmean-99 syst-NUMA02        2.22 (   0.00%)        1.52 (  31.68%)
BAmean-99 elsp-NUMA02       12.89 (   0.00%)       21.58 ( -67.39%)

                    6.5.0-rc6   6.5.0-rc6
                 sched-pidclear-v1r5sched-forcescan-v1r5
Duration User        5702.00    10264.25
Duration System        17.02       13.59
Duration Elapsed       92.57      156.30

Similar results seen across multiple machines. It's not universally bad
but the NUMA02 tests appear to suffer quite badly and while not realistic,
they are somewhat relevant because numa02 is likely an "adverse workload"
for the logic that skips VMAs based on PID accesses.

For the rest of the series, the changelogs lacked detail on why those
changes helped. Patch 4's changelog lacks detail and patch 6 stating
"VMAs being accessed by more than two tasks are critical" is not helpful
either -- e.g. why are they critical?

Agree, for patch 5 and 6 (scanning shared VMA and recently accessed
VMAs) there was a brief rationale in cover letter, but it was not enough
perhaps.

More background:
I had used trace_prints to understand vma sizes, PID hash, success
percentage of is_vma_accessed(), and also how many tasks are typically
accessing etc for some of the workloads..
(vma_size here was in KB)

E.g.,
<...>-1451602 [116] ...1. 39195.488591: vma_fault: vma=ffff8bcab42ad7b8 
pid=1451602 hash=40, success=1
           <...>-1451481 [210] ..... 39196.948390: sched_numascan: 
comm=numa01 pid=1451481 vma = ffff8bc9228637b8 
access_hist=4200000cfe66727 hashval = 26 bitmap_wt = 22, vma_size = 
3153924 success = 1
           <...>-1451570 [052] ...1. 39196.948725: vma_fault: 
vma=ffff8bc9228637b8 pid=1451570 hash=25, success=1

1) For very large VMAs we may incur delay in scanning whole VMA,
because we scan only in 256MB chunks and filter out tasks which had not
touched them etc, So idea was to speed up the scanning.

2) Similar rationale for recently accessed VMA, i.e., not to delay
scanning for a very recently (hot) accessed VMAs.

[ I did not explore using young page info, mm walk etc as I thought it
may be expensive ].

They are obviously shared VMAs and
therefore it may be the case that they need to be identified and interleaved
quickly

Yes. Mostly that was idea as mentioned above.

but maybe not. Is the shared VMA that is critical a large malloc'd
area split into per-thread sections or something that is MAP_SHARED? The
changelog doesn't say so I have to guess.  > There are also a bunch of
magic variables with limited explanation (e.g. why NR_ACCESS_PID_HIST==4
and SHARED_VMA_THRESH=3?),

Those thresholds were result of multiple experiments I did.
(SHARED_VMA_THRESH = 3,4 .. NR_ACCESS_PID_HIST=3, 4 etc ).

One thing I did not look is whether I should reduce PID_RESET interval
(because we are maintaining more history now.)

the numab fields are not documented 
Agree, I should have done better earlier.

and the
changelogs lack supporting data. I suspect that patches 3-6 may be dealing
with regressions introduced by patch 2, particularly for NUMA02, but I'm

TBH, Did not really target to worsen num02, improve num02 later.
This is the data I had for the full patchset.

autonumabench
                             base                   patched
Min       syst-NUMA02        0.99 (   0.00%)        0.99 (   0.00%)
Min       elsp-NUMA02        3.04 (   0.00%)        3.04 (   0.00%)
Amean     syst-NUMA02        1.06 (   0.00%)        1.05 *   1.08%*
Amean     elsp-NUMA02        3.80 (   0.00%)        3.39 *  10.68%*
Stddev    syst-NUMA02        0.10 (   0.00%)        0.07 (  24.57%)
Stddev    elsp-NUMA02        0.73 (   0.00%)        0.34 (  52.86%)
CoeffVar  syst-NUMA02        9.04 (   0.00%)        6.89 (  23.75%)
CoeffVar  elsp-NUMA02       19.25 (   0.00%)       10.16 (  47.22%)
Max       syst-NUMA02        1.27 (   0.00%)        1.21 (   4.72%)
Max       elsp-NUMA02        4.91 (   0.00%)        4.04 (  17.72%)
BAmean-50 syst-NUMA02        1.00 (   0.00%)        1.01 (  -0.66%)
BAmean-50 elsp-NUMA02        3.21 (   0.00%)        3.12 (   2.60%)
BAmean-95 syst-NUMA02        1.03 (   0.00%)        1.02 (   0.32%)
BAmean-95 elsp-NUMA02        3.61 (   0.00%)        3.28 (   9.09%)
BAmean-99 syst-NUMA02        1.03 (   0.00%)        1.02 (   0.32%)
BAmean-99 elsp-NUMA02        3.61 (   0.00%)        3.28 (   9.09%)

Duration User        1555.24     1377.57
Duration System         8.10        7.99
Duration Elapsed       30.86       26.49

But then, I saw result from Kernel test Robot, which compared individual
patches,

commit:
  2f88c8e802 ("(tip/sched/core) sched/eevdf/doc: Modify the documented 
knob to base_slice_ns as well")
  2a806eab1c ("sched/numa: Move up the access pid reset logic")
  1ef5cbb92b ("sched/numa: Add disjoint vma unconditional scan logic")
  68cfe9439a ("sched/numa: Allow scanning of shared VMAs")

2f88c8e802c8b128 2a806eab1c2e1c9f0ae39dc0307 1ef5cbb92bdb320c5eb9fdee1a8 
68cfe9439a1baa642e05883fa64
---------------- --------------------------- --------------------------- 
---------------------------
         %stddev     %change         %stddev     %change 
%stddev     %change         %stddev
             \          |                \          |                \ 
         |                \
    271.01            +0.8%     273.24            -0.7%     269.00 
      -26.4%     199.49 ±  3%  autonuma-benchmark.numa01.seconds
     76.28            +0.2%      76.44           -11.7%      67.36 ± 
6%     -46.9%      40.49 ±  5% 
autonuma-benchmark.numa01_THREAD_ALLOC.seconds
      8.11            -0.9%       8.04            -0.7%       8.05 
       -0.1%       8.10        autonuma-benchmark.numa02.seconds
      1425            +0.7%       1434            -3.1%       1381 
      -30.1%     996.02 ±  2%  autonuma-benchmark.time.elapsed_time

I do see some negligible overhead from first patch but second patch
still gave some improvement.

My observation with the patchset was increase in system time
 because of additional scanning we re-introduced but this
was still 2x better than where we started without numascan enhancements.

not certain as I didn't dedicate the necessary test time to prove that
and it's the type of information that should be in the changelog. While
there is nothing wrong with that as such, it's very hard to imagine how
patches 3-6 work in every case and be certain that the various parameters
make sense. That could cause difficulties later in terms of maintenance.

Agree regarding maintenance.

My initial thinking was "There should be a standalone series that deals
*only* with scanning VMAs that had no fault activity and skipped due to
PID hashing". These are important because there may be no fault activity
because there is no scan activity which is due to to fault activity. The
series is incomplete and without changelogs but I pushed it anyway to

Agreed.

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ sched-numabselective-v1r5

Thanks.. Patches are simple to start with (1-4) with a force scan in
patch5. Will experiment with these.

The first two patches simply improve the documentation on what is going
on, patch 3 adds a tracepoint for figuring out why VMAs were skipped or
not skipped. Patch 4 handles a corner case to complete the scan of a VMA
once it has started regardless of what task is doing the scanning. The
last patch scans VMAs that have seen no fault activity once active VMAs
have been scanned.

It has its weaknesses because it may be overly simplisitic and it forces
all VMAs to be scanned on every sequence which is wasteful. It also hurts
NUMA02 performance, although not as badly as ""sched/numa: Add disjoint
vma unconditional scan logic". On the plus side, it is easier to reason
about, it solves only one problem in the series and any patch on top or
modification should justify each change individually.

Anything else you have in mind that I should look into apart from
above (Rebasing to your patches and experiment with my patch 3-6 for any
cumulative improvements ?).

Thanks and Regards
- Raghu