On Fri, 14 Jun 2024 at 11:28, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Thu, Jun 13, 2024 at 06:15:59PM +0000, K Prateek Nayak wrote: > > Effects of call_function_single_prep_ipi() > > ========================================== > > > > To pull a TIF_POLLING thread out of idle to process an IPI, the sender > > sets the TIF_NEED_RESCHED bit in the idle task's thread info in > > call_function_single_prep_ipi() and avoids sending an actual IPI to the > > target. As a result, the scheduler expects a task to be enqueued when > > exiting the idle path. This is not the case with non-polling idle states > > where the idle CPU exits the non-polling idle state to process the > > interrupt, and since need_resched() returns false, soon goes back to > > idle again. > > > > When TIF_NEED_RESCHED flag is set, do_idle() will call schedule_idle(), > > a large part of which runs with local IRQ disabled. In case of ipistorm, > > when measuring IPI throughput, this large IRQ disabled section delays > > processing of IPIs. Further auditing revealed that in absence of any > > runnable tasks, pick_next_task_fair(), which is called from the > > pick_next_task() fast path, will always call newidle_balance() in this > > scenario, further increasing the time spent in the IRQ disabled section. > > > > Following is the crude visualization of the problem with relevant > > functions expanded: > > -- > > CPU0 CPU1 > > ==== ==== > > do_idle() { > > __current_set_polling(); > > ... > > monitor(addr); > > if (!need_resched()) > > mwait() { > > /* Waiting */ > > smp_call_function_single(CPU1, func, wait = 1) { ... > > ... ... > > set_nr_if_polling(CPU1) { ... > > /* Realizes CPU1 is polling */ ... > > try_cmpxchg(addr, ... > > &val, ... > > val | _TIF_NEED_RESCHED); ... > > } /* Does not send an IPI */ ... > > ... } /* mwait exit due to write at addr */ > > csd_lock_wait() { } > > /* Waiting */ preempt_set_need_resched(); > > ... __current_clr_polling(); > > ... flush_smp_call_function_queue() { > > ... func(); > > } /* End of wait */ } > > } schedule_idle() { > > ... > > local_irq_disable(); > > smp_call_function_single(CPU1, func, wait = 1) { ... > > ... ... > > arch_send_call_function_single_ipi(CPU1); ... > > \ ... > > \ newidle_balance() { > > \ ... > > /* Delay */ ... > > \ } > > \ ... > > \--------------> local_irq_enable(); > > /* Processes the IPI */ > > -- > > > > > > Skipping newidle_balance() > > ========================== > > > > In an earlier attempt to solve the challenge of the long IRQ disabled > > section, newidle_balance() was skipped when a CPU waking up from idle > > was found to have no runnable tasks, and was transitioning back to > > idle [2]. Tim [3] and David [4] had pointed out that newidle_balance() > > may be viable for CPUs that are idling with tick enabled, where the > > newidle_balance() has the opportunity to pull tasks onto the idle CPU. > > I don't think we should be relying on this in any way shape or form. > NOHZ can kill that tick at any time. > > Also, semantically, calling newidle from the idle thread is just daft. > You're really not newly idle in that case. > > > Vincent [5] pointed out a case where the idle load kick will fail to > > run on an idle CPU since the IPI handler launching the ILB will check > > for need_resched(). In such cases, the idle CPU relies on > > newidle_balance() to pull tasks towards itself. > > Is this the need_resched() in _nohz_idle_balance() ? Should we change > this to 'need_resched() && (rq->nr_running || rq->ttwu_pending)' or > something long those lines? It's not only this but also in do_idle() as well which exits the loop to look for tasks to schedule > > I mean, it's fairly trivial to figure out if there really is going to be > work there. > > > Using an alternate flag instead of NEED_RESCHED to indicate a pending > > IPI was suggested as the correct approach to solve this problem on the > > same thread. > > So adding per-arch changes for this seems like something we shouldn't > unless there really is no other sane options. > > That is, I really think we should start with something like the below > and then fix any fallout from that. The main problem is that need_resched becomes somewhat meaningless because it doesn't only mean "I need to resched a task" and we have to add more tests around even for those not using polling > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 0935f9d4bb7b..cfa45338ae97 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5799,7 +5800,7 @@ static inline struct task_struct * > __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > { > const struct sched_class *class; > - struct task_struct *p; > + struct task_struct *p = NULL; > > /* > * Optimization: we know that if all tasks are in the fair class we can > @@ -5810,9 +5811,11 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && > rq->nr_running == rq->cfs.h_nr_running)) { > > - p = pick_next_task_fair(rq, prev, rf); > - if (unlikely(p == RETRY_TASK)) > - goto restart; > + if (rq->nr_running) { How do you make the diff between a spurious need_resched() because of polling and a cpu becoming idle ? isn't rq->nr_running null in both cases ? In the later case, we need to call sched_balance_newidle() but not in the former > + p = pick_next_task_fair(rq, prev, rf); > + if (unlikely(p == RETRY_TASK)) > + goto restart; > + } > > /* Assume the next prioritized class is idle_sched_class */ > if (!p) {