On 2/3/2023 4:45 PM, Peter Zijlstra wrote:
On Wed, Feb 01, 2023 at 01:32:21PM +0530, Raghavendra K T wrote:
During the Numa scanning make sure only relevant vmas of the
tasks are scanned.
Before:
All the tasks of a process participate in scanning the vma
even if they do not access vma in it's lifespan.
Now:
Except cases of first few unconditional scans, if a process do
not touch vma (exluding false positive cases of PID collisions)
tasks no longer scan all vma.
Logic used:
1) 6 bits of PID used to mark active bit in vma numab status during
fault to remember PIDs accessing vma. (Thanks Mel)
2) Subsequently in scan path, vma scanning is skipped if current PID
had not accessed vma.
3) First two times we do allow unconditional scan to preserve earlier
behaviour of scanning.
Acknowledgement to Bharata B Rao <bharata@xxxxxxx> for initial patch
to store pid information.
Suggested-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Signed-off-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
---
include/linux/mm.h | 14 ++++++++++++++
include/linux/mm_types.h | 1 +
kernel/sched/fair.c | 15 +++++++++++++++
mm/huge_memory.c | 1 +
mm/memory.c | 1 +
5 files changed, 32 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74d9df1d8982..489422942482 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1381,6 +1381,16 @@ static inline int xchg_page_access_time(struct page *page, int time)
last_time = page_cpupid_xchg_last(page, time >> PAGE_ACCESS_TIME_BUCKETS);
return last_time << PAGE_ACCESS_TIME_BUCKETS;
}
+
+static inline void vma_set_active_pid_bit(struct vm_area_struct *vma)
+{
+ unsigned int active_pid_bit;
+
+ if (vma->numab) {
+ active_pid_bit = current->pid % BITS_PER_LONG;
+ vma->numab->accessing_pids |= 1UL << active_pid_bit;
+ }
+}
Perhaps:
if (vma->numab)
__set_bit(current->pid % BITS_PER_LONG, &vma->numab->pids);
?
Or maybe even:
bit = current->pid % BITS_PER_LONG;
if (vma->numab && !__test_bit(bit, &vma->numab->pids))
__set_bit(bit, &vma->numab->pids);
Sure ..will use one of the above.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 060b241ce3c5..3505ae57c07c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2916,6 +2916,18 @@ static void reset_ptenuma_scan(struct task_struct *p)
p->mm->numa_scan_offset = 0;
}
+static bool vma_is_accessed(struct vm_area_struct *vma)
+{
+ unsigned int active_pid_bit;
+
/*
* Tell us why 2....
*/
Agree. The logic is more towards allowing unconditional scan first two
times to build task/page relation. I will experiment if we further need
to allow for two full passes if "multi-stage node selection" (=4), to
take care of early migration.
But only doubt I have is numa_scan_seq is per mm and thus will it create
corner cases or we need to have a per vma count separately when a new
VMA is created..
+ if (READ_ONCE(current->mm->numa_scan_seq) < 2)
+ return true;
+
+ active_pid_bit = current->pid % BITS_PER_LONG;
+
+ return vma->numab->accessing_pids & (1UL << active_pid_bit);
return __test_bit(current->pid % BITS_PER_LONG, &vma->numab->pids)
+}
+
/*
* The expensive part of numa migration is done from task_work context.
* Triggered from task_tick_numa().
@@ -3032,6 +3044,9 @@ static void task_numa_work(struct callback_head *work)
if (mm->numa_scan_seq && time_before(jiffies, vma->numab->next_scan))
continue;
/*
* tell us more...
*/
Sure. Since this is the core of the whole logic where we want to confine
VMA scan to PIDs of interest mostly.
+ if (!vma_is_accessed(vma))
+ continue;
+
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
This feels wrong, specifically we track numa_scan_offset per mm, now, if
we divide the threads into two dis-joint groups each only using their
own set of vmas (in fact quite common for workloads with proper data
partitioning) it is possible to consistently sample one set of threads
and thus not scan the other set of vmas.
It seems somewhat unlikely, but not impossible to create significant
unfairness.
Agree, But that is the reason why we want to allow first few
unconditional scans Or am I missing something?
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 811d19b5c4f6..d908aa95f3c3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1485,6 +1485,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
bool was_writable = pmd_savedwrite(oldpmd);
int flags = 0;
+ vma_set_active_pid_bit(vma);
vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
spin_unlock(vmf->ptl);
diff --git a/mm/memory.c b/mm/memory.c
index 8c8420934d60..2ec3045cb8b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4718,6 +4718,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
bool was_writable = pte_savedwrite(vmf->orig_pte);
int flags = 0;
+ vma_set_active_pid_bit(vma);
/*
* The "pte" at this point cannot be used safely without
* validation through pte_unmap_same(). It's of NUMA type but
Urghh... do_*numa_page() is two near identical functions.. is there
really no sane way to de-duplicate at least some of that?
Agree. I will explore and will take that as a separate TODO.
Also, is this placement right, you're marking the thread even before we
know there's even a page there. I would expect this somewhere around
where we track lastpid.
Good point. I will check this again
Maybe numa_migrate_prep() ?
yes.. there was no hurry to record accessing pid early above...