On 2024-06-19 at 00:03:30 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > On 6/18/2024 1:19 PM, Chen Yu wrote: > > [..snip..] > > > > > > > > > > > > > Vincent [5] pointed out a case where the idle load kick will fail to > > > > > > > run on an idle CPU since the IPI handler launching the ILB will check > > > > > > > for need_resched(). In such cases, the idle CPU relies on > > > > > > > newidle_balance() to pull tasks towards itself. > > > > > > > > > > > > Is this the need_resched() in _nohz_idle_balance() ? Should we change > > > > > > this to 'need_resched() && (rq->nr_running || rq->ttwu_pending)' or > > > > > > something long those lines? > > > > > > > > > > It's not only this but also in do_idle() as well which exits the loop > > > > > to look for tasks to schedule > > > > > > > > > > > > > > > > > I mean, it's fairly trivial to figure out if there really is going to be > > > > > > work there. > > > > > > > > > > > > > Using an alternate flag instead of NEED_RESCHED to indicate a pending > > > > > > > IPI was suggested as the correct approach to solve this problem on the > > > > > > > same thread. > > > > > > > > > > > > So adding per-arch changes for this seems like something we shouldn't > > > > > > unless there really is no other sane options. > > > > > > > > > > > > That is, I really think we should start with something like the below > > > > > > and then fix any fallout from that. > > > > > > > > > > The main problem is that need_resched becomes somewhat meaningless > > > > > because it doesn't only mean "I need to resched a task" and we have > > > > > to add more tests around even for those not using polling > > > > > > > > > > > > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > > > > index 0935f9d4bb7b..cfa45338ae97 100644 > > > > > > --- a/kernel/sched/core.c > > > > > > +++ b/kernel/sched/core.c > > > > > > @@ -5799,7 +5800,7 @@ static inline struct task_struct * > > > > > > __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > > > > > { > > > > > > const struct sched_class *class; > > > > > > - struct task_struct *p; > > > > > > + struct task_struct *p = NULL; > > > > > > > > > > > > /* > > > > > > * Optimization: we know that if all tasks are in the fair class we can > > > > > > @@ -5810,9 +5811,11 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > > > > > if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && > > > > > > rq->nr_running == rq->cfs.h_nr_running)) { > > > > > > > > > > > > - p = pick_next_task_fair(rq, prev, rf); > > > > > > - if (unlikely(p == RETRY_TASK)) > > > > > > - goto restart; > > > > > > + if (rq->nr_running) { > > > > > > > > > > How do you make the diff between a spurious need_resched() because of > > > > > polling and a cpu becoming idle ? isn't rq->nr_running null in both > > > > > cases ? > > > > > In the later case, we need to call sched_balance_newidle() but not in the former > > > > > > > > > > > > > Not sure if I understand correctly, if the goal of smp_call_function_single() is to > > > > kick the idle CPU and do not force it to launch the schedule()->sched_balance_newidle(), > > > > can we set the _TIF_POLLING_NRFLAG rather than _TIF_NEED_RESCHED in set_nr_if_polling()? > > > > I think writing any value to the monitor address would wakeup the idle CPU. And _TIF_POLLING_NRFLAG > > > > will be cleared once that idle CPU exit the idle loop, so we don't introduce arch-wide flag. > > > Although this might work for MWAIT, there is no way for the generic idle > > > path to know if there is a pending interrupt within a TIF_POLLING_NRFLAG > > > section. do_idle() sets TIF_POLLING_NRFLAG and relies on a bunch of > > > need_resched() checks along the way to bail early until finally doing a > > > current_clr_polling_and_test() before handing off to the cpuidle driver > > > in call_cpuidle(). I believe this section will necessarily need the sender > > > to indicate a pending interrupt via TIF_NEED_RESCHED flag to enable the > > > early bail out before going into the cpuidle driver since this case cannot > > > be considered the same as a break from MWAIT. > > > > > > > I see, this is a good point. So you mean with only TIF_POLLING_NRFLAG there is > > possibility that the 'ipi kick CPU out of idle' is lost after the CPU enters > > do_idle() and before finally entering the idle state. While setting _TIF_NEED_RESCHED > > could help the do_idle() loop to detect pending request easier. > > Yup, that is correct. > > > BTW, before the > > commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()"), the > > lost of ipi after entering do_idle() and before entering driver idle state > > is also possible, right(the local irq is disabled)? > > From what I understand, the IPI remains pending until the interrupts > are enabled again. Before the optimization, the interrupts would be > disabled all the way until the instruction that is used to put the CPU > to sleep which is what __sti_mwait() and native_safe_halt() does. The > CPU would have received the IPI then and broke out of idle before > Peter's optimization went in. I see, once local irq is enabled, the pending ipi will be served. > There is an elaborate comment on this in > do_idle() function above the call to local_irq_disable(). In commit > edc8fc01f608 ("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer > reprogram") Peter describes a case of actually missing the break from > an interrupt as the driver enabled interrupts much earlier than > executing the sleep instruction. > Yup, the commit edc8fc01f608 deals with delay of the timer handling. If a timer queues the callback after local irq enabled and before mwait, the long sleep time after mwait might delay the handling of the callback. > Since the CPU was in TIF_POLLING_NRFLAG state, one could simply get away > by setting TIF_NEED_RESCHED and not sending an actual IPI which the > need_resched() checks in the idle path would catch and the > flush_smp_call_function_queue() on the exit path would have serviced the > call function. > > MWAIT with Interrupt Break extension (CPUID 0x5 ECX[IBE]) can break out > on pending interrupts even if interrupts are disabled which is why > "mwait_idle_with_hints()" now checks "ecx" to choose between "__mwait()" > and "__mwait_sti()". The APM describes the extension to "allows > interrupts to wake MWAIT, even when eFLAGS.IF = 0". (Vol. 3. > "General-Purpose and System Instructions", Chapter 4. "System Instruction > Reference", Section "MWAIT") > > I do hope someone corrects me if I'm wrong :) > You are right, and thanks for the description. thanks, Chenyu