On Fri 2022-09-09 16:35 -0300, Marcelo Tosatti wrote: > For the scenario where we re-enter idle without calling quiet_vmstat: > > > CPU-0 CPU-1 > > 0) vmstat_shepherd notices its necessary to queue vmstat work > to remote CPU, queues deferrable timer into timer wheel, and calls > trigger_dyntick_cpu (target_cpu == cpu-1). > > 1) Stop the tick (get_next_timer_interrupt will not take deferrable > timers into account), calls quiet_vmstat, which keeps the vmstat work > (vmstat_update function) queued. > 2) Idle > 3) Idle exit > 4) Run thread on CPU, some activity marks vmstat dirty > 5) Idle > 6) Goto 3 > > At 5, since the tick is already stopped, the deferrable > timer for the delayed work item will not execute, > and vmstat_shepherd will consider > > static void vmstat_shepherd(struct work_struct *w) > { > int cpu; > > cpus_read_lock(); > /* Check processors whose vmstat worker threads have been disabled */ > for_each_online_cpu(cpu) { > struct delayed_work *dw = &per_cpu(vmstat_work, cpu); > > if (!delayed_work_pending(dw) && need_update(cpu)) > queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); > > cond_resched(); > } > cpus_read_unlock(); > > schedule_delayed_work(&shepherd, > round_jiffies_relative(sysctl_stat_interval)); > } > > As far as i can tell... Hi Marcelo, Yes, I agree with the scenario above. > > > Consider the following theoretical scenario: > > > > > > 1. CPU Y migrated running task A to CPU X that was > > > in an idle state i.e. waiting for an IRQ - not > > > polling; marked the current task on CPU X to > > > need/or require a reschedule i.e., set > > > TIF_NEED_RESCHED and invoked a reschedule IPI to > > > CPU X (see sched_move_task()) > > > > CPU Y is nohz_full right? > > > > > > > > 2. CPU X acknowledged the reschedule IPI from CPU Y; > > > generic idle loop code noticed the > > > TIF_NEED_RESCHED flag against the idle task and > > > attempts to exit of the loop and calls the main > > > scheduler function i.e. __schedule(). > > > > > > Since the idle tick was previously stopped no > > > scheduling-clock tick would occur. > > > So, no deferred timers would be handled > > > > > > 3. Post transition to kernel execution Task A > > > running on CPU Y, indirectly released a few pages > > > (e.g. see __free_one_page()); CPU Y's > > > 'vm_stat_diff[NR_FREE_PAGES]' was updated and zone > > > specific 'vm_stat[]' update was deferred as per the > > > CPU-specific stat threshold > > > > > > 4. Task A does invoke exit(2) and the kernel does > > > remove the task from the run-queue; the idle task > > > was selected to execute next since there are no > > > other runnable tasks assigned to the given CPU > > > (see pick_next_task() and pick_next_task_idle()) > > > > This happens on CPU X, right? > > > > > > > > 5. On return to the idle loop since the idle tick > > > was already stopped and can remain so (see [1] > > > below) e.g. no pending soft IRQs, no attempt is > > > made to zero and fold CPU Y's vmstat counters > > > since reprogramming of the scheduling-clock tick > > > is not required/or needed (see [2]) > > > > And now back to CPU Y, confused... > > Aaron, can you explain the diagram above? Hi Frederic, Sorry about that. How about the following: - Note: CPU X is part of 'tick_nohz_full_mask' 1. CPU Y migrated running task A to CPU X that was in an idle state i.e. waiting for an IRQ; marked the current task on CPU X to need/or require a reschedule i.e., set TIF_NEED_RESCHED and invoked a reschedule IPI to CPU X (see sched_move_task()) 2. CPU X acknowledged the reschedule IPI. Generic idle loop code noticed the TIF_NEED_RESCHED flag against the idle task and attempts to exit of the loop and calls the main scheduler function i.e. __schedule(). Since the idle tick was previously stopped no scheduling-clock tick would occur. So, no deferred timers would be handled 3. Post transition to kernel execution Task A running on CPU X, indirectly released a few pages (e.g. see __free_one_page()); CPU X's 'vm_stat_diff[NR_FREE_PAGES]' was updated and zone specific 'vm_stat[]' update was deferred as per the CPU-specific stat threshold 4. Task A does invoke exit(2) and the kernel does remove the task from the run-queue; the idle task was selected to execute next since there are no other runnable tasks assigned to the given CPU (see pick_next_task() and pick_next_task_idle()) 5. On return to the idle loop since the idle tick was already stopped and can remain so (see [1] below) e.g. no pending soft IRQs, no attempt is made to zero and fold CPU X's vmstat counters since reprogramming of the scheduling-clock tick is not required/or needed (see [2]) Kind regards, -- Aaron Tomlin