Re: [PATCH v7 2/3] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too

Aaron Tomlin <atomlin@xxxxxxxxxx> · Mon, 12 Sep 2022 15:38:23 +0100

On Fri 2022-09-09 16:35 -0300, Marcelo Tosatti wrote:
> For the scenario where we re-enter idle without calling quiet_vmstat:
> 
> 
> CPU-0			CPU-1
> 
> 0) vmstat_shepherd notices its necessary to queue vmstat work 
> to remote CPU, queues deferrable timer into timer wheel, and calls
> trigger_dyntick_cpu (target_cpu == cpu-1).
> 
> 			1) Stop the tick (get_next_timer_interrupt will not take deferrable
> 			   timers into account), calls quiet_vmstat, which keeps the vmstat work
> 			   (vmstat_update function) queued.
> 			2) Idle
> 			3) Idle exit
> 			4) Run thread on CPU, some activity marks vmstat dirty
> 			5) Idle
> 			6) Goto 3
> 
> At 5, since the tick is already stopped, the deferrable 
> timer for the delayed work item will not execute,
> and vmstat_shepherd will consider 
> 
> static void vmstat_shepherd(struct work_struct *w)
> {
>         int cpu;
> 
>         cpus_read_lock();
>         /* Check processors whose vmstat worker threads have been disabled */
>         for_each_online_cpu(cpu) {
>                 struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> 
>                 if (!delayed_work_pending(dw) && need_update(cpu))
>                         queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> 
>                 cond_resched();
>         }
>         cpus_read_unlock();
> 
>         schedule_delayed_work(&shepherd,
>                 round_jiffies_relative(sysctl_stat_interval));
> }
> 
> As far as i can tell...

Hi Marcelo,

Yes, I agree with the scenario above.

> > > Consider the following theoretical scenario:
> > > 
> > >         1.      CPU Y migrated running task A to CPU X that was
> > >                 in an idle state i.e. waiting for an IRQ - not
> > >                 polling; marked the current task on CPU X to
> > >                 need/or require a reschedule i.e., set
> > >                 TIF_NEED_RESCHED and invoked a reschedule IPI to
> > >                 CPU X (see sched_move_task())
> > 
> > CPU Y is nohz_full right?
> > 
> > > 
> > >         2.      CPU X acknowledged the reschedule IPI from CPU Y;
> > >                 generic idle loop code noticed the
> > >                 TIF_NEED_RESCHED flag against the idle task and
> > >                 attempts to exit of the loop and calls the main
> > >                 scheduler function i.e. __schedule().
> > > 
> > >                 Since the idle tick was previously stopped no
> > >                 scheduling-clock tick would occur.
> > >                 So, no deferred timers would be handled
> > > 
> > >         3.      Post transition to kernel execution Task A
> > >                 running on CPU Y, indirectly released a few pages
> > >                 (e.g. see __free_one_page()); CPU Y's
> > >                 'vm_stat_diff[NR_FREE_PAGES]' was updated and zone
> > >                 specific 'vm_stat[]' update was deferred as per the
> > >                 CPU-specific stat threshold
> > > 
> > >         4.      Task A does invoke exit(2) and the kernel does
> > >                 remove the task from the run-queue; the idle task
> > >                 was selected to execute next since there are no
> > >                 other runnable tasks assigned to the given CPU
> > >                 (see pick_next_task() and pick_next_task_idle())
> > 
> > This happens on CPU X, right?
> > 
> > > 
> > >         5.      On return to the idle loop since the idle tick
> > >                 was already stopped and can remain so (see [1]
> > >                 below) e.g. no pending soft IRQs, no attempt is
> > >                 made to zero and fold CPU Y's vmstat counters
> > >                 since reprogramming of the scheduling-clock tick
> > >                 is not required/or needed (see [2])
> > 
> > And now back to CPU Y, confused...
> 
> Aaron, can you explain the diagram above? 

Hi Frederic,

Sorry about that. How about the following:

 - Note: CPU X is part of 'tick_nohz_full_mask'

    1.      CPU Y migrated running task A to CPU X that
	    was in an idle state i.e. waiting for an IRQ;
	    marked the current task on CPU X to need/or
	    require a reschedule i.e., set TIF_NEED_RESCHED
	    and invoked a reschedule IPI to CPU X
	    (see sched_move_task())

    2.      CPU X acknowledged the reschedule IPI. Generic
	    idle loop code noticed the TIF_NEED_RESCHED flag
	    against the idle task and attempts to exit of the
	    loop and calls the main scheduler function i.e.
	    __schedule().

	    Since the idle tick was previously stopped no
	    scheduling-clock tick would occur.
	    So, no deferred timers would be handled

    3.      Post transition to kernel execution Task A
	    running on CPU X, indirectly released a few pages
	    (e.g. see __free_one_page()); CPU X's
	    'vm_stat_diff[NR_FREE_PAGES]' was updated and zone
	    specific 'vm_stat[]' update was deferred as per the
	    CPU-specific stat threshold

    4.      Task A does invoke exit(2) and the kernel does
	    remove the task from the run-queue; the idle task
	    was selected to execute next since there are no
	    other runnable tasks assigned to the given CPU
	    (see pick_next_task() and pick_next_task_idle())

    5.      On return to the idle loop since the idle tick
	    was already stopped and can remain so (see [1]
	    below) e.g. no pending soft IRQs, no attempt is
	    made to zero and fold CPU X's vmstat counters
	    since reprogramming of the scheduling-clock tick
	    is not required/or needed (see [2])

Kind regards,

-- 
Aaron Tomlin