On 16-May-23 2:49 PM, Raghavendra K T wrote: > With the numa scan enhancements [1], only the threads which had previously > accessed vma are allowed to scan. > > While this had improved significant system time overhead, there were corner > cases, which genuinely need some relaxation. For e.g., > > 1) Concern raised by PeterZ, where if there are N partition sets of vmas > belonging to tasks, then unfairness in allowing these threads to scan could > potentially amplify the side effect of some of the vmas being left > unscanned. > > 2) Below reports of LKP numa01 benchmark regression. > > Currently this was handled by allowing first two scanning unconditional > as indicated by mm->numa_scan_seq. This is imprecise since for some > benchmark vma scanning might itself start at numa_scan_seq > 2. > > Solution: > Allow unconditional scanning of vmas of tasks depending on vma size. This > is achieved by maintaining a per vma scan counter, where > > f(allowed_to_scan) = f(scan_counter < vma_size / scan_size) > > Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic") > regression. > > Result: > numa01_THREAD_ALLOC result on 6.4.0-rc1 (that has w/ numascan enhancement) > base-numascan base base+fix > real 1m3.025s 1m24.163s 1m3.551s > user 213m44.232s 251m3.638s 219m55.662s > sys 6m26.598s 0m13.056s 2m35.767s > > numa_hit 5478165 4395752 4907431 > numa_local 5478103 4395366 4907044 > numa_other 62 386 387 > numa_pte_updates 1989274 11606 1265014 > numa_hint_faults 1756059 515 1135804 > numa_hint_faults_local 971500 486 558076 > numa_pages_migrated 784211 29 577728 > > Summary: Regression in base is recovered by allowing scanning as required. > > [1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@xxxxxxx/T/#t > > Reported-by: Aithal Srikanth <sraithal@xxxxxxx> > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx> > Closes: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@xxxxxxx/T/ > Signed-off-by: Raghavendra K T <raghavendra.kt@xxxxxxx> > --- > include/linux/mm_types.h | 1 + > kernel/sched/fair.c | 41 ++++++++++++++++++++++++++++++++-------- > 2 files changed, 34 insertions(+), 8 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 306a3d1a0fa6..992e460a713e 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -479,6 +479,7 @@ struct vma_numab_state { > unsigned long next_scan; > unsigned long next_pid_reset; > unsigned long access_pids[2]; > + unsigned int scan_counter; > }; > > /* > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 373ff5f55884..2c3e17e7fc2f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2931,20 +2931,34 @@ static void reset_ptenuma_scan(struct task_struct *p) > static bool vma_is_accessed(struct vm_area_struct *vma) > { > unsigned long pids; > + unsigned int vma_size; > + unsigned int scan_threshold; > + unsigned int scan_size; > + > + pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1]; > + > + if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids)) > + return true; > + > + scan_size = READ_ONCE(sysctl_numa_balancing_scan_size); > + /* vma size in MB */ > + vma_size = (vma->vm_end - vma->vm_start) >> 20; > + > + /* Total scans needed to cover VMA */ > + scan_threshold = (vma_size / scan_size); > + > /* > - * Allow unconditional access first two times, so that all the (pages) > - * of VMAs get prot_none fault introduced irrespective of accesses. > + * Allow the scanning of half of disjoint set's VMA to induce > + * prot_none fault irrespective of accesses. > * This is also done to avoid any side effect of task scanning > * amplifying the unfairness of disjoint set of VMAs' access. > */ > - if (READ_ONCE(current->mm->numa_scan_seq) < 2) > - return true; > - > - pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1]; > - return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids); > + scan_threshold = 1 + (scan_threshold >> 1); > + return (READ_ONCE(vma->numab_state->scan_counter) <= scan_threshold); > } > > -#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay) > +#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay) > +#define DISJOINT_VMA_SCAN_RENEW_THRESH 16 > > /* > * The expensive part of numa migration is done from task_work context. > @@ -3058,6 +3072,8 @@ static void task_numa_work(struct callback_head *work) > /* Reset happens after 4 times scan delay of scan start */ > vma->numab_state->next_pid_reset = vma->numab_state->next_scan + > msecs_to_jiffies(VMA_PID_RESET_PERIOD); > + > + WRITE_ONCE(vma->numab_state->scan_counter, 0); > } > > /* > @@ -3068,6 +3084,13 @@ static void task_numa_work(struct callback_head *work) > vma->numab_state->next_scan)) > continue; > > + /* > + * For long running tasks, renew the disjoint vma scanning > + * periodically. > + */ > + if (mm->numa_scan_seq && !(mm->numa_scan_seq % DISJOINT_VMA_SCAN_RENEW_THRESH)) Don't you need a READ_ONCE() accessor for mm->numa_scan_seq? Regards, Bharata.