On 2024-06-17 at 14:03:41 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > On 6/14/2024 10:01 PM, Chen Yu wrote: > > On 2024-06-14 at 12:48:37 +0200, Vincent Guittot wrote: > > > On Fri, 14 Jun 2024 at 11:28, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > > > > > On Thu, Jun 13, 2024 at 06:15:59PM +0000, K Prateek Nayak wrote: > > > > > Effects of call_function_single_prep_ipi() > > > > > ========================================== > > > > > > > > > > To pull a TIF_POLLING thread out of idle to process an IPI, the sender > > > > > sets the TIF_NEED_RESCHED bit in the idle task's thread info in > > > > > call_function_single_prep_ipi() and avoids sending an actual IPI to the > > > > > target. As a result, the scheduler expects a task to be enqueued when > > > > > exiting the idle path. This is not the case with non-polling idle states > > > > > where the idle CPU exits the non-polling idle state to process the > > > > > interrupt, and since need_resched() returns false, soon goes back to > > > > > idle again. > > > > > > > > > > When TIF_NEED_RESCHED flag is set, do_idle() will call schedule_idle(), > > > > > a large part of which runs with local IRQ disabled. In case of ipistorm, > > > > > when measuring IPI throughput, this large IRQ disabled section delays > > > > > processing of IPIs. Further auditing revealed that in absence of any > > > > > runnable tasks, pick_next_task_fair(), which is called from the > > > > > pick_next_task() fast path, will always call newidle_balance() in this > > > > > scenario, further increasing the time spent in the IRQ disabled section. > > > > > > > > > > Following is the crude visualization of the problem with relevant > > > > > functions expanded: > > > > > -- > > > > > CPU0 CPU1 > > > > > ==== ==== > > > > > do_idle() { > > > > > __current_set_polling(); > > > > > ... > > > > > monitor(addr); > > > > > if (!need_resched()) > > > > > mwait() { > > > > > /* Waiting */ > > > > > smp_call_function_single(CPU1, func, wait = 1) { ... > > > > > ... ... > > > > > set_nr_if_polling(CPU1) { ... > > > > > /* Realizes CPU1 is polling */ ... > > > > > try_cmpxchg(addr, ... > > > > > &val, ... > > > > > val | _TIF_NEED_RESCHED); ... > > > > > } /* Does not send an IPI */ ... > > > > > ... } /* mwait exit due to write at addr */ > > > > > csd_lock_wait() { } > > > > > /* Waiting */ preempt_set_need_resched(); > > > > > ... __current_clr_polling(); > > > > > ... flush_smp_call_function_queue() { > > > > > ... func(); > > > > > } /* End of wait */ } > > > > > } schedule_idle() { > > > > > ... > > > > > local_irq_disable(); > > > > > smp_call_function_single(CPU1, func, wait = 1) { ... > > > > > ... ... > > > > > arch_send_call_function_single_ipi(CPU1); ... > > > > > \ ... > > > > > \ newidle_balance() { > > > > > \ ... > > > > > /* Delay */ ... > > > > > \ } > > > > > \ ... > > > > > \--------------> local_irq_enable(); > > > > > /* Processes the IPI */ > > > > > -- > > > > > > > > > > > > > > > Skipping newidle_balance() > > > > > ========================== > > > > > > > > > > In an earlier attempt to solve the challenge of the long IRQ disabled > > > > > section, newidle_balance() was skipped when a CPU waking up from idle > > > > > was found to have no runnable tasks, and was transitioning back to > > > > > idle [2]. Tim [3] and David [4] had pointed out that newidle_balance() > > > > > may be viable for CPUs that are idling with tick enabled, where the > > > > > newidle_balance() has the opportunity to pull tasks onto the idle CPU. > > > > > > > > I don't think we should be relying on this in any way shape or form. > > > > NOHZ can kill that tick at any time. > > > > > > > > Also, semantically, calling newidle from the idle thread is just daft. > > > > You're really not newly idle in that case. > > > > > > > > > Vincent [5] pointed out a case where the idle load kick will fail to > > > > > run on an idle CPU since the IPI handler launching the ILB will check > > > > > for need_resched(). In such cases, the idle CPU relies on > > > > > newidle_balance() to pull tasks towards itself. > > > > > > > > Is this the need_resched() in _nohz_idle_balance() ? Should we change > > > > this to 'need_resched() && (rq->nr_running || rq->ttwu_pending)' or > > > > something long those lines? > > > > > > It's not only this but also in do_idle() as well which exits the loop > > > to look for tasks to schedule > > > > > > > > > > > I mean, it's fairly trivial to figure out if there really is going to be > > > > work there. > > > > > > > > > Using an alternate flag instead of NEED_RESCHED to indicate a pending > > > > > IPI was suggested as the correct approach to solve this problem on the > > > > > same thread. > > > > > > > > So adding per-arch changes for this seems like something we shouldn't > > > > unless there really is no other sane options. > > > > > > > > That is, I really think we should start with something like the below > > > > and then fix any fallout from that. > > > > > > The main problem is that need_resched becomes somewhat meaningless > > > because it doesn't only mean "I need to resched a task" and we have > > > to add more tests around even for those not using polling > > > > > > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > > index 0935f9d4bb7b..cfa45338ae97 100644 > > > > --- a/kernel/sched/core.c > > > > +++ b/kernel/sched/core.c > > > > @@ -5799,7 +5800,7 @@ static inline struct task_struct * > > > > __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > > > { > > > > const struct sched_class *class; > > > > - struct task_struct *p; > > > > + struct task_struct *p = NULL; > > > > > > > > /* > > > > * Optimization: we know that if all tasks are in the fair class we can > > > > @@ -5810,9 +5811,11 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > > > > if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && > > > > rq->nr_running == rq->cfs.h_nr_running)) { > > > > > > > > - p = pick_next_task_fair(rq, prev, rf); > > > > - if (unlikely(p == RETRY_TASK)) > > > > - goto restart; > > > > + if (rq->nr_running) { > > > > > > How do you make the diff between a spurious need_resched() because of > > > polling and a cpu becoming idle ? isn't rq->nr_running null in both > > > cases ? > > > In the later case, we need to call sched_balance_newidle() but not in the former > > > > > > > Not sure if I understand correctly, if the goal of smp_call_function_single() is to > > kick the idle CPU and do not force it to launch the schedule()->sched_balance_newidle(), > > can we set the _TIF_POLLING_NRFLAG rather than _TIF_NEED_RESCHED in set_nr_if_polling()? > > I think writing any value to the monitor address would wakeup the idle CPU. And _TIF_POLLING_NRFLAG > > will be cleared once that idle CPU exit the idle loop, so we don't introduce arch-wide flag. > Although this might work for MWAIT, there is no way for the generic idle > path to know if there is a pending interrupt within a TIF_POLLING_NRFLAG > section. do_idle() sets TIF_POLLING_NRFLAG and relies on a bunch of > need_resched() checks along the way to bail early until finally doing a > current_clr_polling_and_test() before handing off to the cpuidle driver > in call_cpuidle(). I believe this section will necessarily need the sender > to indicate a pending interrupt via TIF_NEED_RESCHED flag to enable the > early bail out before going into the cpuidle driver since this case cannot > be considered the same as a break from MWAIT. > I see, this is a good point. So you mean with only TIF_POLLING_NRFLAG there is possibility that the 'ipi kick CPU out of idle' is lost after the CPU enters do_idle() and before finally entering the idle state. While setting _TIF_NEED_RESCHED could help the do_idle() loop to detect pending request easier. BTW, before the commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()"), the lost of ipi after entering do_idle() and before entering driver idle state is also possible, right(the local irq is disabled)? > On x86, there seems to be a possibility of missing an interrupt if > someone writes _TIF_POLLING_NRFLAG to thread info between the target > executing MONTOR and MWAIT. AMD64 Architecture Programmer’s Manual > Volume 3: "General-Purpose and System Instructions", Chapter 4. "System > Instruction Reference", section "MWAIT" carries the following note in > the coding requirements: > > "MWAIT must be conditionally executed only if the awaited store has not > already occurred. (This prevents a race condition between the MONITOR > instruction arming the monitoring hardware and the store intended to > trigger the monitoring hardware.)" > > There exists a similar note in the "Example" section for "MWAIT" in > Intel 64 and IA-32 Architectures Software Developer’s Manual, Vol 2B > Chapter 4.3 "Instructions (M-U)" > Thanks for the explaination of this race condition in detail. thanks, Chenyu