On 2024-06-14 at 12:48:37 +0200, Vincent Guittot wrote: > On Fri, 14 Jun 2024 at 11:28, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > On Thu, Jun 13, 2024 at 06:15:59PM +0000, K Prateek Nayak wrote: > > > Effects of call_function_single_prep_ipi() > > > ========================================== > > > > > > To pull a TIF_POLLING thread out of idle to process an IPI, the sender > > > sets the TIF_NEED_RESCHED bit in the idle task's thread info in > > > call_function_single_prep_ipi() and avoids sending an actual IPI to the > > > target. As a result, the scheduler expects a task to be enqueued when > > > exiting the idle path. This is not the case with non-polling idle states > > > where the idle CPU exits the non-polling idle state to process the > > > interrupt, and since need_resched() returns false, soon goes back to > > > idle again. > > > > > > When TIF_NEED_RESCHED flag is set, do_idle() will call schedule_idle(), > > > a large part of which runs with local IRQ disabled. In case of ipistorm, > > > when measuring IPI throughput, this large IRQ disabled section delays > > > processing of IPIs. Further auditing revealed that in absence of any > > > runnable tasks, pick_next_task_fair(), which is called from the > > > pick_next_task() fast path, will always call newidle_balance() in this > > > scenario, further increasing the time spent in the IRQ disabled section. > > > > > > Following is the crude visualization of the problem with relevant > > > functions expanded: > > > -- > > > CPU0 CPU1 > > > ==== ==== > > > do_idle() { > > > __current_set_polling(); > > > ... > > > monitor(addr); > > > if (!need_resched()) > > > mwait() { > > > /* Waiting */ > > > smp_call_function_single(CPU1, func, wait = 1) { ... > > > ... ... > > > set_nr_if_polling(CPU1) { ... > > > /* Realizes CPU1 is polling */ ... > > > try_cmpxchg(addr, ... > > > &val, ... > > > val | _TIF_NEED_RESCHED); ... > > > } /* Does not send an IPI */ ... > > > ... } /* mwait exit due to write at addr */ > > > csd_lock_wait() { } > > > /* Waiting */ preempt_set_need_resched(); > > > ... __current_clr_polling(); > > > ... flush_smp_call_function_queue() { > > > ... func(); > > > } /* End of wait */ } > > > } schedule_idle() { > > > ... > > > local_irq_disable(); > > > smp_call_function_single(CPU1, func, wait = 1) { ... > > > ... ... > > > arch_send_call_function_single_ipi(CPU1); ... > > > \ ... > > > \ newidle_balance() { > > > \ ... > > > /* Delay */ ... > > > \ } > > > \ ... > > > \--------------> local_irq_enable(); > > > /* Processes the IPI */ > > > -- > > > > > > > > > Skipping newidle_balance() > > > ========================== > > > > > > In an earlier attempt to solve the challenge of the long IRQ disabled > > > section, newidle_balance() was skipped when a CPU waking up from idle > > > was found to have no runnable tasks, and was transitioning back to > > > idle [2]. Tim [3] and David [4] had pointed out that newidle_balance() > > > may be viable for CPUs that are idling with tick enabled, where the > > > newidle_balance() has the opportunity to pull tasks onto the idle CPU. > > > > I don't think we should be relying on this in any way shape or form. > > NOHZ can kill that tick at any time. > > > > Also, semantically, calling newidle from the idle thread is just daft. > > You're really not newly idle in that case. > > > > > Vincent [5] pointed out a case where the idle load kick will fail to > > > run on an idle CPU since the IPI handler launching the ILB will check > > > for need_resched(). In such cases, the idle CPU relies on > > > newidle_balance() to pull tasks towards itself. > > > > Is this the need_resched() in _nohz_idle_balance() ? Should we change > > this to 'need_resched() && (rq->nr_running || rq->ttwu_pending)' or > > something long those lines? > > It's not only this but also in do_idle() as well which exits the loop > to look for tasks to schedule > > > > > I mean, it's fairly trivial to figure out if there really is going to be > > work there. > > > > > Using an alternate flag instead of NEED_RESCHED to indicate a pending > > > IPI was suggested as the correct approach to solve this problem on the > > > same thread. > > > > So adding per-arch changes for this seems like something we shouldn't > > unless there really is no other sane options. > > > > That is, I really think we should start with something like the below > > and then fix any fallout from that. > > The main problem is that need_resched becomes somewhat meaningless > because it doesn't only mean "I need to resched a task" and we have > to add more tests around even for those not using polling > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 0935f9d4bb7b..cfa45338ae97 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -5799,7 +5800,7 @@ static inline struct task_struct * > > __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > { > > const struct sched_class *class; > > - struct task_struct *p; > > + struct task_struct *p = NULL; > > > > /* > > * Optimization: we know that if all tasks are in the fair class we can > > @@ -5810,9 +5811,11 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && > > rq->nr_running == rq->cfs.h_nr_running)) { > > > > - p = pick_next_task_fair(rq, prev, rf); > > - if (unlikely(p == RETRY_TASK)) > > - goto restart; > > + if (rq->nr_running) { > > How do you make the diff between a spurious need_resched() because of > polling and a cpu becoming idle ? isn't rq->nr_running null in both > cases ? > In the later case, we need to call sched_balance_newidle() but not in the former > Not sure if I understand correctly, if the goal of smp_call_function_single() is to kick the idle CPU and do not force it to launch the schedule()->sched_balance_newidle(), can we set the _TIF_POLLING_NRFLAG rather than _TIF_NEED_RESCHED in set_nr_if_polling()? I think writing any value to the monitor address would wakeup the idle CPU. And _TIF_POLLING_NRFLAG will be cleared once that idle CPU exit the idle loop, so we don't introduce arch-wide flag. thanks, Chenyu > > + p = pick_next_task_fair(rq, prev, rf); > > + if (unlikely(p == RETRY_TASK)) > > + goto restart; > > + } > > > > /* Assume the next prioritized class is idle_sched_class */ > > if (!p) {