Re: [PATCH v2 00/14] Introducing TIF_NOTIFY_IPI flag

Chen Yu <yu.c.chen@xxxxxxxxx> · Thu, 20 Jun 2024 15:30:36 +0800

On 2024-06-19 at 00:03:30 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 6/18/2024 1:19 PM, Chen Yu wrote:
> > [..snip..]
> > > > > > 
> > > > > > > Vincent [5] pointed out a case where the idle load kick will fail to
> > > > > > > run on an idle CPU since the IPI handler launching the ILB will check
> > > > > > > for need_resched(). In such cases, the idle CPU relies on
> > > > > > > newidle_balance() to pull tasks towards itself.
> > > > > > 
> > > > > > Is this the need_resched() in _nohz_idle_balance() ? Should we change
> > > > > > this to 'need_resched() && (rq->nr_running || rq->ttwu_pending)' or
> > > > > > something long those lines?
> > > > > 
> > > > > It's not only this but also in do_idle() as well which exits the loop
> > > > > to look for tasks to schedule
> > > > > 
> > > > > > 
> > > > > > I mean, it's fairly trivial to figure out if there really is going to be
> > > > > > work there.
> > > > > > 
> > > > > > > Using an alternate flag instead of NEED_RESCHED to indicate a pending
> > > > > > > IPI was suggested as the correct approach to solve this problem on the
> > > > > > > same thread.
> > > > > > 
> > > > > > So adding per-arch changes for this seems like something we shouldn't
> > > > > > unless there really is no other sane options.
> > > > > > 
> > > > > > That is, I really think we should start with something like the below
> > > > > > and then fix any fallout from that.
> > > > > 
> > > > > The main problem is that need_resched becomes somewhat meaningless
> > > > > because it doesn't  only mean "I need to resched a task" and we have
> > > > > to add more tests around even for those not using polling
> > > > > 
> > > > > > 
> > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > > > index 0935f9d4bb7b..cfa45338ae97 100644
> > > > > > --- a/kernel/sched/core.c
> > > > > > +++ b/kernel/sched/core.c
> > > > > > @@ -5799,7 +5800,7 @@ static inline struct task_struct *
> > > > > >    __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > > > > >    {
> > > > > >           const struct sched_class *class;
> > > > > > -       struct task_struct *p;
> > > > > > +       struct task_struct *p = NULL;
> > > > > > 
> > > > > >           /*
> > > > > >            * Optimization: we know that if all tasks are in the fair class we can
> > > > > > @@ -5810,9 +5811,11 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > > > > >           if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) &&
> > > > > >                      rq->nr_running == rq->cfs.h_nr_running)) {
> > > > > > 
> > > > > > -               p = pick_next_task_fair(rq, prev, rf);
> > > > > > -               if (unlikely(p == RETRY_TASK))
> > > > > > -                       goto restart;
> > > > > > +               if (rq->nr_running) {
> > > > > 
> > > > > How do you make the diff between a spurious need_resched() because of
> > > > > polling and a cpu becoming idle ? isn't rq->nr_running null in both
> > > > > cases ?
> > > > > In the later case, we need to call sched_balance_newidle() but not in the former
> > > > > 
> > > > 
> > > > Not sure if I understand correctly, if the goal of smp_call_function_single() is to
> > > > kick the idle CPU and do not force it to launch the schedule()->sched_balance_newidle(),
> > > > can we set the _TIF_POLLING_NRFLAG rather than _TIF_NEED_RESCHED in set_nr_if_polling()?
> > > > I think writing any value to the monitor address would wakeup the idle CPU. And _TIF_POLLING_NRFLAG
> > > > will be cleared once that idle CPU exit the idle loop, so we don't introduce arch-wide flag.
> > > Although this might work for MWAIT, there is no way for the generic idle
> > > path to know if there is a pending interrupt within a TIF_POLLING_NRFLAG
> > > section. do_idle() sets TIF_POLLING_NRFLAG and relies on a bunch of
> > > need_resched() checks along the way to bail early until finally doing a
> > > current_clr_polling_and_test() before handing off to the cpuidle driver
> > > in call_cpuidle(). I believe this section will necessarily need the sender
> > > to indicate a pending interrupt via TIF_NEED_RESCHED flag to enable the
> > > early bail out before going into the cpuidle driver since this case cannot
> > > be considered the same as a break from MWAIT.
> > > 
> > 
> > I see, this is a good point. So you mean with only TIF_POLLING_NRFLAG there is
> > possibility that the 'ipi kick CPU out of idle' is lost after the CPU enters
> > do_idle() and before finally entering the idle state. While setting _TIF_NEED_RESCHED
> > could help the do_idle() loop to detect pending request easier.
> 
> Yup, that is correct.
> 
> > BTW, before the
> > commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()"), the
> > lost of ipi after entering do_idle() and before entering driver idle state
> > is also possible, right(the local irq is disabled)?
> 
> From what I understand, the IPI remains pending until the interrupts
> are enabled again. Before the optimization, the interrupts would be
> disabled all the way until the instruction that is used to put the CPU
> to sleep which is what __sti_mwait() and native_safe_halt() does. The
> CPU would have received the IPI then and broke out of idle before
> Peter's optimization went in.

I see, once local irq is enabled, the pending ipi will be served.

> There is an elaborate comment on this in
> do_idle() function above the call to local_irq_disable(). In  commit
> edc8fc01f608 ("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer
> reprogram") Peter describes a case of actually missing the break from
> an interrupt as the driver enabled interrupts much earlier than
> executing the sleep instruction.
>

Yup, the commit edc8fc01f608 deals with delay of the timer handling. If
a timer queues the callback after local irq enabled and before mwait,
the long sleep time after mwait might delay the handling of the callback.

> Since the CPU was in TIF_POLLING_NRFLAG state, one could simply get away
> by setting TIF_NEED_RESCHED and not sending an actual IPI which the
> need_resched() checks in the idle path would catch and the
> flush_smp_call_function_queue() on the exit path would have serviced the
> call function.
> 
> MWAIT with Interrupt Break extension (CPUID 0x5 ECX[IBE]) can break out
> on pending interrupts even if interrupts are disabled  which is why
> "mwait_idle_with_hints()" now checks "ecx" to choose between "__mwait()"
> and "__mwait_sti()". The APM describes the extension to "allows
> interrupts to wake MWAIT, even when eFLAGS.IF = 0". (Vol. 3.
> "General-Purpose and System Instructions", Chapter 4. "System Instruction
> Reference", Section "MWAIT")
> 
> I do hope someone corrects me if I'm wrong :)
>

You are right, and thanks for the description.

thanks,
Chenyu