On Tue, Jun 21, 2022 at 06:22:07PM +0100, Aaron Tomlin wrote: > Hi Frederic and Marcelo, > > I have incorporated an idea from Marcelo's patch [1] where a CPU-specific > variable is used to indicate if a vmstat differential/or imbalance is > present for a given CPU. So, at the appropriate time, vmstat processing can > be initiated. The hope is that this particular approach is "cheaper" when > compared to need_update() - used currently; in the context of nohz_full and > the scheduling-clock tick being stopped, we would now with this patch, > check if a CPU-specific vmstat imbalance is present before exiting > user-mode (see tick_nohz_user_enter_prepare()). > > This trivial test program [2] was used to determine the somewhat impact > under vanilla and with the proposed changes; mlock(2) and munlock(2) was > used solely to modify vmstat item 'NR_MLOCK'. The following is an average > count of CPU-cycles across the aforementioned system calls and the idle > loop, respectively. I believe these results are negligible: > > Modified | Vanilla > | > | > cycles per syscall: 7399 | cycles per syscall: 4150 > cycles per idle loop: 141048 | cycles per idle loop: 144730 > | > > > Any feedback would be appreciated. Thanks. > > Changes since v3 [3]: > > - Used EXPORT_SYMBOL() on tick_nohz_user_enter_prepare() > - Replaced need_update() > - Introduced CPU-specific variable namely 'vmstat_dirty' > and mark_vmstat_dirty() > > [1]: https://lore.kernel.org/lkml/20220204173554.763888172@fedora.localdomain/ > [2]: https://pastebin.com/8AtzSAuK > [3]: https://lore.kernel.org/lkml/20220422193647.3808657-1-atomlin@xxxxxxxxxx/ > > -- > > In the context of the idle task and an adaptive-tick mode/or a nohz_full > CPU, quiet_vmstat() can be called: before stopping the idle tick, > entering an idle state and on exit. In particular, for the latter case, > when the idle task is required to reschedule, the idle tick can remain > stopped and the timer expiration time endless i.e., KTIME_MAX. Now, > indeed before a nohz_full CPU enters an idle state, CPU-specific vmstat > counters should be processed to ensure the respective values have been > reset and folded into the zone specific 'vm_stat[]'. That being said, it > can only occur when: the idle tick was previously stopped, and > reprogramming of the timer is not required. > > A customer provided some evidence which indicates that the idle tick was > stopped; albeit, CPU-specific vmstat counters still remained populated. > Thus one can only assume quiet_vmstat() was not invoked on return to the > idle loop. > > If I understand correctly, I suspect this divergence might erroneously > prevent a reclaim attempt by kswapd. If the number of zone specific free > pages are below their per-cpu drift value then > zone_page_state_snapshot() is used to compute a more accurate view of > the aforementioned statistic. Thus any task blocked on the NUMA node > specific pfmemalloc_wait queue will be unable to make significant > progress via direct reclaim unless it is killed after being woken up by > kswapd (see throttle_direct_reclaim()). > > Consider the following theoretical scenario: > > 1. CPU Y migrated running task A to CPU X that was > in an idle state i.e. waiting for an IRQ - not > polling; marked the current task on CPU X to > need/or require a reschedule i.e., set > TIF_NEED_RESCHED and invoked a reschedule IPI to > CPU X (see sched_move_task()) > > 2. CPU X acknowledged the reschedule IPI from CPU Y; > generic idle loop code noticed the > TIF_NEED_RESCHED flag against the idle task and > attempts to exit of the loop and calls the main > scheduler function i.e. __schedule(). > > Since the idle tick was previously stopped no > scheduling-clock tick would occur. > So, no deferred timers would be handled > > 3. Post transition to kernel execution Task A > running on CPU Y, indirectly released a few pages > (e.g. see __free_one_page()); CPU Y's > 'vm_stat_diff[NR_FREE_PAGES]' was updated and zone > specific 'vm_stat[]' update was deferred as per the > CPU-specific stat threshold > > 4. Task A does invoke exit(2) and the kernel does > remove the task from the run-queue; the idle task > was selected to execute next since there are no > other runnable tasks assigned to the given CPU > (see pick_next_task() and pick_next_task_idle()) > > 5. On return to the idle loop since the idle tick > was already stopped and can remain so (see [1] > below) e.g. no pending soft IRQs, no attempt is > made to zero and fold CPU Y's vmstat counters > since reprogramming of the scheduling-clock tick > is not required/or needed (see [2]) > > ... > do_idle > { > > __current_set_polling() > tick_nohz_idle_enter() > > while (!need_resched()) { > > local_irq_disable() > > ... > > /* No polling or broadcast event */ > cpuidle_idle_call() > { > > if (cpuidle_not_available(drv, dev)) { > tick_nohz_idle_stop_tick() > __tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched)) > { > int cpu = smp_processor_id() > > if (ts->timer_expires_base) > expires = ts->timer_expires > else if (can_stop_idle_tick(cpu, ts)) > (1) -------> expires = tick_nohz_next_event(ts, cpu) > else > return > > ts->idle_calls++ > > if (expires > 0LL) { > > tick_nohz_stop_tick(ts, cpu) > { > > if (ts->tick_stopped && (expires == ts->next_tick)) { > (2) -------> if (tick == KTIME_MAX || ts->next_tick == > hrtimer_get_expires(&ts->sched_timer)) > return > } > ... > } > > So the idea of with this patch is to ensure refresh_cpu_vm_stats(false) is > called, when it is appropriate, on return to the idle loop when the idle > tick was previously stopped too. Additionally, in the context of > nohz_full, when the scheduling-tick is stopped and before exiting > to user-mode, ensure no CPU-specific vmstat differentials remain. > > Signed-off-by: Aaron Tomlin <atomlin@xxxxxxxxxx> > --- > include/linux/tick.h | 9 ++------ > kernel/time/tick-sched.c | 19 ++++++++++++++++- > mm/vmstat.c | 46 +++++++++++++--------------------------- > 3 files changed, 35 insertions(+), 39 deletions(-) > > diff --git a/include/linux/tick.h b/include/linux/tick.h > index bfd571f18cfd..4c576c9ca0a2 100644 > --- a/include/linux/tick.h > +++ b/include/linux/tick.h > @@ -11,7 +11,6 @@ > #include <linux/context_tracking_state.h> > #include <linux/cpumask.h> > #include <linux/sched.h> > -#include <linux/rcupdate.h> > > #ifdef CONFIG_GENERIC_CLOCKEVENTS > extern void __init tick_init(void); > @@ -123,6 +122,8 @@ enum tick_dep_bits { > #define TICK_DEP_MASK_RCU (1 << TICK_DEP_BIT_RCU) > #define TICK_DEP_MASK_RCU_EXP (1 << TICK_DEP_BIT_RCU_EXP) > > +void tick_nohz_user_enter_prepare(void); > + > #ifdef CONFIG_NO_HZ_COMMON > extern bool tick_nohz_enabled; > extern bool tick_nohz_tick_stopped(void); > @@ -305,10 +306,4 @@ static inline void tick_nohz_task_switch(void) > __tick_nohz_task_switch(); > } > > -static inline void tick_nohz_user_enter_prepare(void) > -{ > - if (tick_nohz_full_cpu(smp_processor_id())) > - rcu_nocb_flush_deferred_wakeup(); > -} > - > #endif > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index d257721c68b8..4cdd71cc292f 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -26,6 +26,7 @@ > #include <linux/posix-timers.h> > #include <linux/context_tracking.h> > #include <linux/mm.h> > +#include <linux/rcupdate.h> > > #include <asm/irq_regs.h> > > @@ -43,6 +44,20 @@ struct tick_sched *tick_get_tick_sched(int cpu) > return &per_cpu(tick_cpu_sched, cpu); > } > > +void tick_nohz_user_enter_prepare(void) > +{ > + struct tick_sched *ts; > + > + if (tick_nohz_full_cpu(smp_processor_id())) { > + ts = this_cpu_ptr(&tick_cpu_sched); > + > + if (ts->tick_stopped) > + quiet_vmstat(); > + rcu_nocb_flush_deferred_wakeup(); > + } > +} > +EXPORT_SYMBOL(tick_nohz_user_enter_prepare); > + > #if defined(CONFIG_NO_HZ_COMMON) || defined(CONFIG_HIGH_RES_TIMERS) > /* > * The time, when the last jiffy update happened. Write access must hold > @@ -891,6 +906,9 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu) > ts->do_timer_last = 0; > } > > + /* Attempt to fold when the idle tick is stopped or not */ > + quiet_vmstat(); > + > /* Skip reprogram of event if its not changed */ > if (ts->tick_stopped && (expires == ts->next_tick)) { > /* Sanity check: make sure clockevent is actually programmed */ > @@ -912,7 +930,6 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu) > */ > if (!ts->tick_stopped) { > calc_load_nohz_start(); > - quiet_vmstat(); > > ts->last_tick = hrtimer_get_expires(&ts->sched_timer); > ts->tick_stopped = 1; > diff --git a/mm/vmstat.c b/mm/vmstat.c > index b75b1a64b54c..7bfcafafe8f7 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -195,6 +195,12 @@ void fold_vm_numa_events(void) > #endif > > #ifdef CONFIG_SMP > +static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty); > + > +static inline void mark_vmstat_dirty(void) > +{ > + this_cpu_write(vmstat_dirty, true); > +} > > int calculate_pressure_threshold(struct zone *zone) > { > @@ -367,6 +373,7 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, > x = 0; > } > __this_cpu_write(*p, x); > + mark_vmstat_dirty(); > > if (IS_ENABLED(CONFIG_PREEMPT_RT)) > preempt_enable(); > @@ -405,6 +412,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, > x = 0; > } > __this_cpu_write(*p, x); > + mark_vmstat_dirty(); > > if (IS_ENABLED(CONFIG_PREEMPT_RT)) > preempt_enable(); > @@ -603,6 +611,7 @@ static inline void mod_zone_state(struct zone *zone, > > if (z) > zone_page_state_add(z, zone, item); > + mark_vmstat_dirty(); > } > > void mod_zone_page_state(struct zone *zone, enum zone_stat_item item, > @@ -671,6 +680,7 @@ static inline void mod_node_state(struct pglist_data *pgdat, > > if (z) > node_page_state_add(z, pgdat, item); > + mark_vmstat_dirty(); > } > > void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, > @@ -1866,6 +1876,7 @@ int sysctl_stat_interval __read_mostly = HZ; > static void refresh_vm_stats(struct work_struct *work) > { > refresh_cpu_vm_stats(true); > + this_cpu_write(vmstat_dirty, false); > } > > int vmstat_refresh(struct ctl_table *table, int write, > @@ -1930,6 +1941,7 @@ int vmstat_refresh(struct ctl_table *table, int write, > static void vmstat_update(struct work_struct *w) > { > if (refresh_cpu_vm_stats(true)) { > + this_cpu_write(vmstat_dirty, false); > /* > * Counters were updated so we expect more updates > * to occur in the future. Keep on running the > @@ -1941,35 +1953,6 @@ static void vmstat_update(struct work_struct *w) > } > } > > -/* > - * Check if the diffs for a certain cpu indicate that > - * an update is needed. > - */ > -static bool need_update(int cpu) > -{ > - pg_data_t *last_pgdat = NULL; > - struct zone *zone; > - > - for_each_populated_zone(zone) { > - struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); > - struct per_cpu_nodestat *n; > - > - /* > - * The fast way of checking if there are any vmstat diffs. > - */ > - if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff))) > - return true; > - > - if (last_pgdat == zone->zone_pgdat) > - continue; > - last_pgdat = zone->zone_pgdat; > - n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu); > - if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff))) > - return true; > - } > - return false; > -} > - > /* > * Switch off vmstat processing and then fold all the remaining differentials > * until the diffs stay at zero. The function is used by NOHZ and can only be > @@ -1983,7 +1966,7 @@ void quiet_vmstat(void) > if (!delayed_work_pending(this_cpu_ptr(&vmstat_work))) > return; > > - if (!need_update(smp_processor_id())) > + if (!__this_cpu_read(vmstat_dirty)) > return; > > /* > @@ -1993,6 +1976,7 @@ void quiet_vmstat(void) > * vmstat_shepherd will take care about that for us. > */ > refresh_cpu_vm_stats(false); > + __this_cpu_write(vmstat_dirty, false); > } > > /* > @@ -2014,7 +1998,7 @@ static void vmstat_shepherd(struct work_struct *w) > for_each_online_cpu(cpu) { > struct delayed_work *dw = &per_cpu(vmstat_work, cpu); > > - if (!delayed_work_pending(dw) && need_update(cpu)) > + if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu)) > queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); > > cond_resched(); > -- > 2.34.3 > With attached test program and: # dmesg | grep isol [ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-tip root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet skew_tick=1 isolcpus=managed_irq,domain,5,6,7 intel_pstate=disable nosoftlockup tsc=nowatchdog nohz=on nohz_full=5,6,7 rcu_nocbs=5,6,7 irqaffinity=0,1,2,3,4 console=ttyS0,115200 nomodeset [ 0.022862] Kernel command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-tip root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet skew_tick=1 isolcpus=managed_irq,domain,5,6,7 intel_pstate=disable nosoftlockup tsc=nowatchdog nohz=on nohz_full=5,6,7 rcu_nocbs=5,6,7 irqaffinity=0,1,2,3,4 console=ttyS0,115200 nomodeset [ 0.041232] Kernel/User page tables isolation: enabled # taskset -c 5 ./test-vmstat-overhead {syscall,idle} I get: Vanilla -v4 cycles per syscall: 7173 6825 cycles per syscall: 7557 6744 cycles per idle loop: 154187 152788 cycles per idle loop: 154398 151593
#include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <unistd.h> #include <string.h> typedef unsigned long long cycles_t; typedef unsigned long long usecs_t; typedef unsigned long long u64; #ifdef __x86_64__ #define DECLARE_ARGS(val, low, high) unsigned long low, high #define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32)) #define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high) #define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high) #else #define DECLARE_ARGS(val, low, high) unsigned long long val #define EAX_EDX_VAL(val, low, high) (val) #define EAX_EDX_ARGS(val, low, high) "A" (val) #define EAX_EDX_RET(val, low, high) "=A" (val) #endif static inline unsigned long long __rdtscll(void) { DECLARE_ARGS(val, low, high); asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high)); return EAX_EDX_VAL(val, low, high); } #define rdtscll(val) do { (val) = __rdtscll(); } while (0) #define NRSYSCALLS 30000 #define NRSLEEPS 100000 void main(int argc, char *argv[]) { unsigned long a, b, cycles; int i, syscall = 0; void *page = malloc(4096); if (mlock(page, 4096)) perror("mlock"); if (munlock(page, 4096)) perror("munlock"); if (argc != 2) { printf("usage: %s {idle,syscall}\n", argv[0]); exit(1); } rdtscll(a); if (strncmp("idle", argv[1], 4) == 0) syscall = 0; else if (strncmp("syscall", argv[1], 7) == 0) syscall = 1; else { printf("usage: %s {idle,syscall}\n", argv[0]); exit(1); } if (syscall == 1) { for (i = 0; i < NRSYSCALLS; i++) { if (mlock(page, 4096)) perror("mlock"); if (munlock(page, 4096)) perror("munlock"); } } else { for (i = 0; i < NRSLEEPS; i++) usleep(10); } rdtscll(b); cycles = b - a; if (syscall == 1) printf("cycles per syscall: %d\n", (b-a)/(NRSYSCALLS*2)); else printf("cycles per idle loop: %d\n", (b-a)/NRSLEEPS); }