Patch "sched/numa: Complete scanning of inactive VMAs when there is no alternative" has been added to the 6.6-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Mon, 30 Sep 2024 20:00:43 -0400

This is a note to let you know that I've just added the patch titled

    sched/numa: Complete scanning of inactive VMAs when there is no alternative

to the 6.6-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     sched-numa-complete-scanning-of-inactive-vmas-when-t.patch
and it can be found in the queue-6.6 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit a5145866409098262ff88a4fba9602b1c2435ad9
Author: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Date:   Tue Oct 10 09:31:43 2023 +0100

    sched/numa: Complete scanning of inactive VMAs when there is no alternative
    
    [ Upstream commit f169c62ff7cd1acf8bac8ae17bfeafa307d9e6fa ]
    
    VMAs are skipped if there is no recent fault activity but this represents
    a chicken-and-egg problem as there may be no fault activity if the PTEs
    are never updated to trap NUMA hints. There is an indirect reliance on
    scanning to be forced early in the lifetime of a task but this may fail
    to detect changes in phase behaviour. Force inactive VMAs to be scanned
    when all other eligible VMAs have been updated within the same scan
    sequence.
    
    Test results in general look good with some changes in performance, both
    negative and positive, depending on whether the additional scanning and
    faulting was beneficial or not to the workload. The autonuma benchmark
    workload NUMA01_THREADLOCAL was picked for closer examination. The workload
    creates two processes with numerous threads and thread-local storage that
    is zero-filled in a loop. It exercises the corner case where unrelated
    threads may skip VMAs that are thread-local to another thread and still
    has some VMAs that inactive while the workload executes.
    
    The VMA skipping activity frequency with and without the patch:
    
            6.6.0-rc2-sched-numabtrace-v1
            =============================
                649 reason=scan_delay
              9,094 reason=unsuitable
             48,915 reason=shared_ro
            143,919 reason=inaccessible
            193,050 reason=pid_inactive
    
            6.6.0-rc2-sched-numabselective-v1
            =============================
                146 reason=seq_completed
                622 reason=ignore_pid_inactive
    
                624 reason=scan_delay
              6,570 reason=unsuitable
             16,101 reason=shared_ro
             27,608 reason=inaccessible
             41,939 reason=pid_inactive
    
    Note that with the patch applied, the PID activity is ignored
    (ignore_pid_inactive) to ensure a VMA with some activity is completely
    scanned. In addition, a small number of VMAs are scanned when no other
    eligible VMA is available during a single scan window (seq_completed).
    The number of times a VMA is skipped due to no PID activity from the
    scanning task (pid_inactive) drops dramatically. It is expected that
    this will increase the number of PTEs updated for NUMA hinting faults
    as well as hinting faults but these represent PTEs that would otherwise
    have been missed. The tradeoff is scan+fault overhead versus improving
    locality due to migration.
    
    On a 2-socket Cascade Lake test machine, the time to complete the
    workload is as follows;
    
                                                     6.6.0-rc2              6.6.0-rc2
                                           sched-numabtrace-v1 sched-numabselective-v1
      Min       elsp-NUMA01_THREADLOCAL      174.22 (   0.00%)      117.64 (  32.48%)
      Amean     elsp-NUMA01_THREADLOCAL      175.68 (   0.00%)      123.34 *  29.79%*
      Stddev    elsp-NUMA01_THREADLOCAL        1.20 (   0.00%)        4.06 (-238.20%)
      CoeffVar  elsp-NUMA01_THREADLOCAL        0.68 (   0.00%)        3.29 (-381.70%)
      Max       elsp-NUMA01_THREADLOCAL      177.18 (   0.00%)      128.03 (  27.74%)
    
    The time to complete the workload is reduced by almost 30%:
    
                         6.6.0-rc2   6.6.0-rc2
                      sched-numabtrace-v1 sched-numabselective-v1 /
      Duration User       91201.80    63506.64
      Duration System      2015.53     1819.78
      Duration Elapsed     1234.77      868.37
    
    In this specific case, system CPU time was not increased but it's not
    universally true.
    
    From vmstat, the NUMA scanning and fault activity is as follows;
    
                                            6.6.0-rc2      6.6.0-rc2
                                  sched-numabtrace-v1 sched-numabselective-v1
      Ops NUMA base-page range updates       64272.00    26374386.00
      Ops NUMA PTE updates                   36624.00       55538.00
      Ops NUMA PMD updates                      54.00       51404.00
      Ops NUMA hint faults                   15504.00       75786.00
      Ops NUMA hint local faults %           14860.00       56763.00
      Ops NUMA hint local percent               95.85          74.90
      Ops NUMA pages migrated                 1629.00     6469222.00
    
    Both the number of PTE updates and hint faults is dramatically
    increased. While this is superficially unfortunate, it represents
    ranges that were simply skipped without the patch. As a result
    of the scanning and hinting faults, many more pages were also
    migrated but as the time to completion is reduced, the overhead
    is offset by the gain.
    
    Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
    Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
    Tested-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
    Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@xxxxxxxxxxxxxxxxxxx
    Stable-dep-of: f22cde4371f3 ("sched/numa: Fix the vma scan starving issue")
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 80d9d1b7685c6..43c19d85dfe7f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -575,6 +575,12 @@ struct vma_numab_state {
 	 * every VMA_PID_RESET_PERIOD jiffies:
 	 */
 	unsigned long pids_active[2];
+
+	/*
+	 * MM scan sequence ID when the VMA was last completely scanned.
+	 * A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
+	 */
+	int prev_scan_seq;
 };
 
 /*
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 7dcc0bdfddbbf..b69afb8630db4 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -22,6 +22,7 @@ enum numa_vmaskip_reason {
 	NUMAB_SKIP_SCAN_DELAY,
 	NUMAB_SKIP_PID_INACTIVE,
 	NUMAB_SKIP_IGNORE_PID,
+	NUMAB_SKIP_SEQ_COMPLETED,
 };
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 27b51c81b1067..010ba1b7cb0ea 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
 	EM( NUMAB_SKIP_INACCESSIBLE,		"inaccessible" )	\
 	EM( NUMAB_SKIP_SCAN_DELAY,		"scan_delay" )	\
 	EM( NUMAB_SKIP_PID_INACTIVE,		"pid_inactive" )	\
-	EMe(NUMAB_SKIP_IGNORE_PID,		"ignore_pid_inactive" )
+	EM( NUMAB_SKIP_IGNORE_PID,		"ignore_pid_inactive" )		\
+	EMe(NUMAB_SKIP_SEQ_COMPLETED,		"seq_completed" )
 
 /* Redefine for export. */
 #undef EM
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03eb1cab320d8..0af2be3ee849e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3233,6 +3233,8 @@ static void task_numa_work(struct callback_head *work)
 	unsigned long nr_pte_updates = 0;
 	long pages, virtpages;
 	struct vma_iterator vmi;
+	bool vma_pids_skipped;
+	bool vma_pids_forced = false;
 
 	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
 
@@ -3275,7 +3277,6 @@ static void task_numa_work(struct callback_head *work)
 	 */
 	p->node_stamp += 2 * TICK_NSEC;
 
-	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
 	virtpages = pages * 8;	   /* Scan up to this much virtual space */
@@ -3285,6 +3286,16 @@ static void task_numa_work(struct callback_head *work)
 
 	if (!mmap_read_trylock(mm))
 		return;
+
+	/*
+	 * VMAs are skipped if the current PID has not trapped a fault within
+	 * the VMA recently. Allow scanning to be forced if there is no
+	 * suitable VMA remaining.
+	 */
+	vma_pids_skipped = false;
+
+retry_pids:
+	start = mm->numa_scan_offset;
 	vma_iter_init(&vmi, mm, start);
 	vma = vma_next(&vmi);
 	if (!vma) {
@@ -3335,6 +3346,13 @@ static void task_numa_work(struct callback_head *work)
 			/* Reset happens after 4 times scan delay of scan start */
 			vma->numab_state->pids_active_reset =  vma->numab_state->next_scan +
 				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+			/*
+			 * Ensure prev_scan_seq does not match numa_scan_seq,
+			 * to prevent VMAs being skipped prematurely on the
+			 * first scan:
+			 */
+			 vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;
 		}
 
 		/*
@@ -3356,8 +3374,19 @@ static void task_numa_work(struct callback_head *work)
 			vma->numab_state->pids_active[1] = 0;
 		}
 
-		/* Do not scan the VMA if task has not accessed */
-		if (!vma_is_accessed(mm, vma)) {
+		/* Do not rescan VMAs twice within the same sequence. */
+		if (vma->numab_state->prev_scan_seq == mm->numa_scan_seq) {
+			mm->numa_scan_offset = vma->vm_end;
+			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SEQ_COMPLETED);
+			continue;
+		}
+
+		/*
+		 * Do not scan the VMA if task has not accessed it, unless no other
+		 * VMA candidate exists.
+		 */
+		if (!vma_pids_forced && !vma_is_accessed(mm, vma)) {
+			vma_pids_skipped = true;
 			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
 			continue;
 		}
@@ -3386,8 +3415,28 @@ static void task_numa_work(struct callback_head *work)
 
 			cond_resched();
 		} while (end != vma->vm_end);
+
+		/* VMA scan is complete, do not scan until next sequence. */
+		vma->numab_state->prev_scan_seq = mm->numa_scan_seq;
+
+		/*
+		 * Only force scan within one VMA at a time, to limit the
+		 * cost of scanning a potentially uninteresting VMA.
+		 */
+		if (vma_pids_forced)
+			break;
 	} for_each_vma(vmi, vma);
 
+	/*
+	 * If no VMAs are remaining and VMAs were skipped due to the PID
+	 * not accessing the VMA previously, then force a scan to ensure
+	 * forward progress:
+	 */
+	if (!vma && !vma_pids_forced && vma_pids_skipped) {
+		vma_pids_forced = true;
+		goto retry_pids;
+	}
+
 out:
 	/*
 	 * It is possible to reach the end of the VMA list but the last few