Re: [PATCH 6/6] sched/numa: Complete scanning of inactive VMAs when there is no alternative

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/10/2023 2:01 PM, Mel Gorman wrote:
VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.

Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.

The VMA skipping activity frequency with and without the patch is as
follows;

6.6.0-rc2-sched-numabtrace-v1
     649 reason=scan_delay
    9094 reason=unsuitable
   48915 reason=shared_ro
  143919 reason=inaccessible
  193050 reason=pid_inactive

6.6.0-rc2-sched-numabselective-v1
     146 reason=seq_completed
     622 reason=ignore_pid_inactive
     624 reason=scan_delay
    6570 reason=unsuitable
   16101 reason=shared_ro
   27608 reason=inaccessible
   41939 reason=pid_inactive

Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.

On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;

                                                6.6.0-rc2              6.6.0-rc2
                                      sched-numabtrace-v1 sched-numabselective-v1
Min       elsp-NUMA01_THREADLOCAL      174.22 (   0.00%)      117.64 (  32.48%)
Amean     elsp-NUMA01_THREADLOCAL      175.68 (   0.00%)      123.34 *  29.79%*
Stddev    elsp-NUMA01_THREADLOCAL        1.20 (   0.00%)        4.06 (-238.20%)
CoeffVar  elsp-NUMA01_THREADLOCAL        0.68 (   0.00%)        3.29 (-381.70%)
Max       elsp-NUMA01_THREADLOCAL      177.18 (   0.00%)      128.03 (  27.74%)

The time to complete the workload is reduced by almost 30%

                    6.6.0-rc2   6.6.0-rc2
                 sched-numabtrace-v1 sched-numabselective-v1 /
Duration User       91201.80    63506.64
Duration System      2015.53     1819.78
Duration Elapsed     1234.77      868.37

In this specific case, system CPU time was not increased but it's not
universally true.

 From vmstat, the NUMA scanning and fault activity is as follows;

                                       6.6.0-rc2      6.6.0-rc2
                             sched-numabtrace-v1 sched-numabselective-v1
Ops NUMA base-page range updates       64272.00    26374386.00
Ops NUMA PTE updates                   36624.00       55538.00
Ops NUMA PMD updates                      54.00       51404.00
Ops NUMA hint faults                   15504.00       75786.00
Ops NUMA hint local faults %           14860.00       56763.00
Ops NUMA hint local percent               95.85          74.90
Ops NUMA pages migrated                 1629.00     6469222.00

Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.

Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
---
  include/linux/mm_types.h             |  6 +++
  include/linux/sched/numa_balancing.h |  1 +
  include/trace/events/sched.h         |  3 +-
  kernel/sched/fair.c                  | 55 ++++++++++++++++++++++++++--
  4 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8cb1dec3e358..a123c1a58617 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -578,6 +578,12 @@ struct vma_numab_state {
  						 * VMA_PID_RESET_PERIOD
  						 * jiffies.
  						 */
+	int prev_scan_seq;			/* MM scan sequence ID when
+						 * the VMA was last completely
+						 * scanned. A VMA is not
+						 * eligible for scanning if
+						 * prev_scan_seq == numa_scan_seq
+						 */
  };
/*
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 7dcc0bdfddbb..b69afb8630db 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -22,6 +22,7 @@ enum numa_vmaskip_reason {
  	NUMAB_SKIP_SCAN_DELAY,
  	NUMAB_SKIP_PID_INACTIVE,
  	NUMAB_SKIP_IGNORE_PID,
+	NUMAB_SKIP_SEQ_COMPLETED,
  };
#ifdef CONFIG_NUMA_BALANCING
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 27b51c81b106..010ba1b7cb0e 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
  	EM( NUMAB_SKIP_INACCESSIBLE,		"inaccessible" )	\
  	EM( NUMAB_SKIP_SCAN_DELAY,		"scan_delay" )	\
  	EM( NUMAB_SKIP_PID_INACTIVE,		"pid_inactive" )	\
-	EMe(NUMAB_SKIP_IGNORE_PID,		"ignore_pid_inactive" )
+	EM( NUMAB_SKIP_IGNORE_PID,		"ignore_pid_inactive" )		\
+	EMe(NUMAB_SKIP_SEQ_COMPLETED,		"seq_completed" )
/* Redefine for export. */
  #undef EM
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 150f01948ec6..72ef60f394ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3175,6 +3175,8 @@ static void task_numa_work(struct callback_head *work)
  	unsigned long nr_pte_updates = 0;
  	long pages, virtpages;
  	struct vma_iterator vmi;
+	bool vma_pids_skipped;
+	bool vma_pids_forced = false;
SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work)); @@ -3217,7 +3219,6 @@ static void task_numa_work(struct callback_head *work)
  	 */
  	p->node_stamp += 2 * TICK_NSEC;
- start = mm->numa_scan_offset;
  	pages = sysctl_numa_balancing_scan_size;
  	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
  	virtpages = pages * 8;	   /* Scan up to this much virtual space */
@@ -3227,6 +3228,16 @@ static void task_numa_work(struct callback_head *work)
if (!mmap_read_trylock(mm))
  		return;
+
+	/*
+	 * VMAs are skipped if the current PID has not trapped a fault within
+	 * the VMA recently. Allow scanning to be forced if there is no
+	 * suitable VMA remaining.
+	 */
+	vma_pids_skipped = false;
+
+retry_pids:
+	start = mm->numa_scan_offset;
  	vma_iter_init(&vmi, mm, start);
  	vma = vma_next(&vmi);
  	if (!vma) {
@@ -3277,6 +3288,13 @@ static void task_numa_work(struct callback_head *work)
  			/* Reset happens after 4 times scan delay of scan start */
  			vma->numab_state->pids_active_reset =  vma->numab_state->next_scan +
  				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+			/*
+			 * Ensure prev_scan_seq does not match numa_scan_seq
+			 * to prevent VMAs being skipped prematurely on the
+			 * first scan.
+			 */
+			 vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;

nit:
Perhaps even vma->numab_state->prev_scan_seq = -1 would have worked, but
does not matter.

  		}




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux