On Fri, Sep 09, 2022 at 02:12:24PM +0200, Frederic Weisbecker wrote: > On Wed, Aug 17, 2022 at 04:13:48PM -0300, Marcelo Tosatti wrote: > > From: Aaron Tomlin <atomlin@xxxxxxxxxx> > > > > In the context of the idle task and an adaptive-tick mode/or a nohz_full > > CPU, quiet_vmstat() can be called: before stopping the idle tick, > > entering an idle state and on exit. In particular, for the latter case, > > when the idle task is required to reschedule, the idle tick can remain > > stopped > > Since quiet_vmstat() is only called when ts->tick_stopped = false, this > can only happen if the idle loop did not enter into dynticks idle mode > but the exiting idle task eventually stops the tick > (tick_nohz_idle_update_tick()). > > This can happen for example if we enter the idle loop with a timer callback > pending in one jiffies, then once that timer fires, which wakes up a task, > we exit the idle loop and then tick_nohz_idle_update_tick() doesn't see any > timer callback pending left and the tick can be stopped. > > Or am I missing something? For the scenario where we re-enter idle without calling quiet_vmstat: CPU-0 CPU-1 0) vmstat_shepherd notices its necessary to queue vmstat work to remote CPU, queues deferrable timer into timer wheel, and calls trigger_dyntick_cpu (target_cpu == cpu-1). 1) Stop the tick (get_next_timer_interrupt will not take deferrable timers into account), calls quiet_vmstat, which keeps the vmstat work (vmstat_update function) queued. 2) Idle 3) Idle exit 4) Run thread on CPU, some activity marks vmstat dirty 5) Idle 6) Goto 3 At 5, since the tick is already stopped, the deferrable timer for the delayed work item will not execute, and vmstat_shepherd will consider static void vmstat_shepherd(struct work_struct *w) { int cpu; cpus_read_lock(); /* Check processors whose vmstat worker threads have been disabled */ for_each_online_cpu(cpu) { struct delayed_work *dw = &per_cpu(vmstat_work, cpu); if (!delayed_work_pending(dw) && need_update(cpu)) queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); cond_resched(); } cpus_read_unlock(); schedule_delayed_work(&shepherd, round_jiffies_relative(sysctl_stat_interval)); } As far as i can tell... > > and the timer expiration time endless i.e., KTIME_MAX. Now, > > indeed before a nohz_full CPU enters an idle state, CPU-specific vmstat > > counters should be processed to ensure the respective values have been > > reset and folded into the zone specific 'vm_stat[]'. That being said, it > > can only occur when: the idle tick was previously stopped, and > > reprogramming of the timer is not required. > > > > A customer provided some evidence which indicates that the idle tick was > > stopped; albeit, CPU-specific vmstat counters still remained populated. > > Thus one can only assume quiet_vmstat() was not invoked on return to the > > idle loop. > > > > If I understand correctly, I suspect this divergence might erroneously > > prevent a reclaim attempt by kswapd. If the number of zone specific free > > pages are below their per-cpu drift value then > > zone_page_state_snapshot() is used to compute a more accurate view of > > the aforementioned statistic. Thus any task blocked on the NUMA node > > specific pfmemalloc_wait queue will be unable to make significant > > progress via direct reclaim unless it is killed after being woken up by > > kswapd (see throttle_direct_reclaim()). > > > > Consider the following theoretical scenario: > > > > 1. CPU Y migrated running task A to CPU X that was > > in an idle state i.e. waiting for an IRQ - not > > polling; marked the current task on CPU X to > > need/or require a reschedule i.e., set > > TIF_NEED_RESCHED and invoked a reschedule IPI to > > CPU X (see sched_move_task()) > > CPU Y is nohz_full right? > > > > > 2. CPU X acknowledged the reschedule IPI from CPU Y; > > generic idle loop code noticed the > > TIF_NEED_RESCHED flag against the idle task and > > attempts to exit of the loop and calls the main > > scheduler function i.e. __schedule(). > > > > Since the idle tick was previously stopped no > > scheduling-clock tick would occur. > > So, no deferred timers would be handled > > > > 3. Post transition to kernel execution Task A > > running on CPU Y, indirectly released a few pages > > (e.g. see __free_one_page()); CPU Y's > > 'vm_stat_diff[NR_FREE_PAGES]' was updated and zone > > specific 'vm_stat[]' update was deferred as per the > > CPU-specific stat threshold > > > > 4. Task A does invoke exit(2) and the kernel does > > remove the task from the run-queue; the idle task > > was selected to execute next since there are no > > other runnable tasks assigned to the given CPU > > (see pick_next_task() and pick_next_task_idle()) > > This happens on CPU X, right? > > > > > 5. On return to the idle loop since the idle tick > > was already stopped and can remain so (see [1] > > below) e.g. no pending soft IRQs, no attempt is > > made to zero and fold CPU Y's vmstat counters > > since reprogramming of the scheduling-clock tick > > is not required/or needed (see [2]) > > And now back to CPU Y, confused... Aaron, can you explain the diagram above? AFAIU the problem can also be understood with the simpler explanation above. > [...] > > Index: linux-2.6/kernel/time/tick-sched.c > > =================================================================== > > --- linux-2.6.orig/kernel/time/tick-sched.c > > +++ linux-2.6/kernel/time/tick-sched.c > > @@ -26,6 +26,7 @@ > > #include <linux/posix-timers.h> > > #include <linux/context_tracking.h> > > #include <linux/mm.h> > > +#include <linux/rcupdate.h> > > > > #include <asm/irq_regs.h> > > > > @@ -519,6 +520,20 @@ void __tick_nohz_task_switch(void) > > } > > } > > > > +void __tick_nohz_user_enter_prepare(void) > > +{ > > + struct tick_sched *ts; > > + > > + if (tick_nohz_full_cpu(smp_processor_id())) { > > + ts = this_cpu_ptr(&tick_cpu_sched); > > + > > + if (ts->tick_stopped) > > + quiet_vmstat(); > > Wasn't it supposed to be part of the quiescing in task isolation > mode? Not requiring application changes seems useful (so if we can drop the need for task isolation ioctls from userspace, that is better). > Because currently vmstat is a deferrable timer but that deferrability > may not apply to nohz_full anymore (outside idle). And quiet_vmstat() > doesn't cancel the timer so you'll still get the disturbance. > > See this patch: https://lore.kernel.org/lkml/20220725104356.GA2950296@lothringen/ Right. But i think the nohz_full applications prefer not to be interrupted with the vmstat_work in the first place. > > + rcu_nocb_flush_deferred_wakeup(); > > + } > > +} > > > > +EXPORT_SYMBOL_GPL(__tick_nohz_user_enter_prepare); > > + > > /* Get the boot-time nohz CPU list from the kernel parameters. */ > > void __init tick_nohz_full_setup(cpumask_var_t cpumask) > > { > > @@ -890,6 +905,9 @@ static void tick_nohz_stop_tick(struct t > > ts->do_timer_last = 0; > > } > > > > + /* Attempt to fold when the idle tick is stopped or not */ > > + quiet_vmstat(); > > + > > /* Skip reprogram of event if its not changed */ > > if (ts->tick_stopped && (expires == ts->next_tick)) { > > /* Sanity check: make sure clockevent is actually programmed */ > > But that chunk looks good. > > Thanks. Do you see any issue with syncing the vmstat on return to userspace as well, if nohz_full and tick is disabled? Note the syscall entry/exit numbers, the overhead is minimal. > > @@ -911,7 +929,6 @@ static void tick_nohz_stop_tick(struct t > > */ > > if (!ts->tick_stopped) { > > calc_load_nohz_start(); > > - quiet_vmstat(); > > > > ts->last_tick = hrtimer_get_expires(&ts->sched_timer); > > ts->tick_stopped = 1; > > > > > >