Re: [PATCH v7 2/3] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Sep 09, 2022 at 02:12:24PM +0200, Frederic Weisbecker wrote:
> On Wed, Aug 17, 2022 at 04:13:48PM -0300, Marcelo Tosatti wrote:
> > From: Aaron Tomlin <atomlin@xxxxxxxxxx>
> > 
> > In the context of the idle task and an adaptive-tick mode/or a nohz_full
> > CPU, quiet_vmstat() can be called: before stopping the idle tick,
> > entering an idle state and on exit. In particular, for the latter case,
> > when the idle task is required to reschedule, the idle tick can remain
> > stopped
> 
> Since quiet_vmstat() is only called when ts->tick_stopped = false, this
> can only happen if the idle loop did not enter into dynticks idle mode
> but the exiting idle task eventually stops the tick
> (tick_nohz_idle_update_tick()).
> 
> This can happen for example if we enter the idle loop with a timer callback
> pending in one jiffies, then once that timer fires, which wakes up a task,
> we exit the idle loop and then tick_nohz_idle_update_tick() doesn't see any
> timer callback pending left and the tick can be stopped.
> 
> Or am I missing something?

For the scenario where we re-enter idle without calling quiet_vmstat:


CPU-0			CPU-1

0) vmstat_shepherd notices its necessary to queue vmstat work 
to remote CPU, queues deferrable timer into timer wheel, and calls
trigger_dyntick_cpu (target_cpu == cpu-1).

			1) Stop the tick (get_next_timer_interrupt will not take deferrable
			   timers into account), calls quiet_vmstat, which keeps the vmstat work
			   (vmstat_update function) queued.
			2) Idle
			3) Idle exit
			4) Run thread on CPU, some activity marks vmstat dirty
			5) Idle
			6) Goto 3

At 5, since the tick is already stopped, the deferrable 
timer for the delayed work item will not execute,
and vmstat_shepherd will consider 

static void vmstat_shepherd(struct work_struct *w)
{
        int cpu;

        cpus_read_lock();
        /* Check processors whose vmstat worker threads have been disabled */
        for_each_online_cpu(cpu) {
                struct delayed_work *dw = &per_cpu(vmstat_work, cpu);

                if (!delayed_work_pending(dw) && need_update(cpu))
                        queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);

                cond_resched();
        }
        cpus_read_unlock();

        schedule_delayed_work(&shepherd,
                round_jiffies_relative(sysctl_stat_interval));
}

As far as i can tell...

> > and the timer expiration time endless i.e., KTIME_MAX. Now,
> > indeed before a nohz_full CPU enters an idle state, CPU-specific vmstat
> > counters should be processed to ensure the respective values have been
> > reset and folded into the zone specific 'vm_stat[]'. That being said, it
> > can only occur when: the idle tick was previously stopped, and
> > reprogramming of the timer is not required.
> > 
> > A customer provided some evidence which indicates that the idle tick was
> > stopped; albeit, CPU-specific vmstat counters still remained populated.
> > Thus one can only assume quiet_vmstat() was not invoked on return to the
> > idle loop.
> > 
> > If I understand correctly, I suspect this divergence might erroneously
> > prevent a reclaim attempt by kswapd. If the number of zone specific free
> > pages are below their per-cpu drift value then
> > zone_page_state_snapshot() is used to compute a more accurate view of
> > the aforementioned statistic.  Thus any task blocked on the NUMA node
> > specific pfmemalloc_wait queue will be unable to make significant
> > progress via direct reclaim unless it is killed after being woken up by
> > kswapd (see throttle_direct_reclaim()).
> > 
> > Consider the following theoretical scenario:
> > 
> >         1.      CPU Y migrated running task A to CPU X that was
> >                 in an idle state i.e. waiting for an IRQ - not
> >                 polling; marked the current task on CPU X to
> >                 need/or require a reschedule i.e., set
> >                 TIF_NEED_RESCHED and invoked a reschedule IPI to
> >                 CPU X (see sched_move_task())
> 
> CPU Y is nohz_full right?
> 
> > 
> >         2.      CPU X acknowledged the reschedule IPI from CPU Y;
> >                 generic idle loop code noticed the
> >                 TIF_NEED_RESCHED flag against the idle task and
> >                 attempts to exit of the loop and calls the main
> >                 scheduler function i.e. __schedule().
> > 
> >                 Since the idle tick was previously stopped no
> >                 scheduling-clock tick would occur.
> >                 So, no deferred timers would be handled
> > 
> >         3.      Post transition to kernel execution Task A
> >                 running on CPU Y, indirectly released a few pages
> >                 (e.g. see __free_one_page()); CPU Y's
> >                 'vm_stat_diff[NR_FREE_PAGES]' was updated and zone
> >                 specific 'vm_stat[]' update was deferred as per the
> >                 CPU-specific stat threshold
> > 
> >         4.      Task A does invoke exit(2) and the kernel does
> >                 remove the task from the run-queue; the idle task
> >                 was selected to execute next since there are no
> >                 other runnable tasks assigned to the given CPU
> >                 (see pick_next_task() and pick_next_task_idle())
> 
> This happens on CPU X, right?
> 
> > 
> >         5.      On return to the idle loop since the idle tick
> >                 was already stopped and can remain so (see [1]
> >                 below) e.g. no pending soft IRQs, no attempt is
> >                 made to zero and fold CPU Y's vmstat counters
> >                 since reprogramming of the scheduling-clock tick
> >                 is not required/or needed (see [2])
> 
> And now back to CPU Y, confused...

Aaron, can you explain the diagram above? 

AFAIU the problem can also be understood with the simpler
explanation above.

> [...]
> > Index: linux-2.6/kernel/time/tick-sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/time/tick-sched.c
> > +++ linux-2.6/kernel/time/tick-sched.c
> > @@ -26,6 +26,7 @@
> >  #include <linux/posix-timers.h>
> >  #include <linux/context_tracking.h>
> >  #include <linux/mm.h>
> > +#include <linux/rcupdate.h>
> >  
> >  #include <asm/irq_regs.h>
> >  
> > @@ -519,6 +520,20 @@ void __tick_nohz_task_switch(void)
> >  	}
> >  }
> >  
> > +void __tick_nohz_user_enter_prepare(void)
> > +{
> > +	struct tick_sched *ts;
> > +
> > +	if (tick_nohz_full_cpu(smp_processor_id())) {
> > +		ts = this_cpu_ptr(&tick_cpu_sched);
> > +
> > +		if (ts->tick_stopped)
> > +			quiet_vmstat();
> 
> Wasn't it supposed to be part of the quiescing in task isolation
> mode?

Not requiring application changes seems useful (so if we can drop
the need for task isolation ioctls from userspace, that is better).

> Because currently vmstat is a deferrable timer but that deferrability
> may not apply to nohz_full anymore (outside idle). And quiet_vmstat()
> doesn't cancel the timer so you'll still get the disturbance.
> 
> See this patch: https://lore.kernel.org/lkml/20220725104356.GA2950296@lothringen/

Right.

But i think the nohz_full applications prefer not to be interrupted 
with the vmstat_work in the first place.

> > +		rcu_nocb_flush_deferred_wakeup();
> > +	}
> > +}
> >
> > +EXPORT_SYMBOL_GPL(__tick_nohz_user_enter_prepare);
> > +
> >  /* Get the boot-time nohz CPU list from the kernel parameters. */
> >  void __init tick_nohz_full_setup(cpumask_var_t cpumask)
> >  {
> > @@ -890,6 +905,9 @@ static void tick_nohz_stop_tick(struct t
> >  		ts->do_timer_last = 0;
> >  	}
> >  
> > +	/* Attempt to fold when the idle tick is stopped or not */
> > +	quiet_vmstat();
> > +
> >  	/* Skip reprogram of event if its not changed */
> >  	if (ts->tick_stopped && (expires == ts->next_tick)) {
> >  		/* Sanity check: make sure clockevent is actually programmed */
> 
> But that chunk looks good.
> 
> Thanks.

Do you see any issue with syncing the vmstat on return to userspace as
well, if nohz_full and tick is disabled? 

Note the syscall entry/exit numbers, the overhead is minimal.

> > @@ -911,7 +929,6 @@ static void tick_nohz_stop_tick(struct t
> >  	 */
> >  	if (!ts->tick_stopped) {
> >  		calc_load_nohz_start();
> > -		quiet_vmstat();
> >  
> >  		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
> >  		ts->tick_stopped = 1;
> > 
> > 
> 
> 





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux