Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes: > On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote: > >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head) >> } >> >> /* >> - * resched_curr - mark rq's current task 'to be rescheduled now'. >> + * __resched_curr - mark rq's current task 'to be rescheduled'. >> * >> - * On UP this means the setting of the need_resched flag, on SMP it >> - * might also involve a cross-CPU call to trigger the scheduler on >> - * the target CPU. >> + * On UP this means the setting of the need_resched flag, on SMP, for >> + * eager resched it might also involve a cross-CPU call to trigger >> + * the scheduler on the target CPU. >> */ >> -void resched_curr(struct rq *rq) >> +void __resched_curr(struct rq *rq, resched_t rs) >> { >> struct task_struct *curr = rq->curr; >> int cpu; >> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq) >> cpu = cpu_of(rq); >> >> if (cpu == smp_processor_id()) { >> - set_tsk_need_resched(curr, RESCHED_eager); >> - set_preempt_need_resched(); >> + set_tsk_need_resched(curr, rs); >> + if (rs == RESCHED_eager) >> + set_preempt_need_resched(); >> return; >> } >> >> - if (set_nr_and_not_polling(curr, RESCHED_eager)) >> - smp_send_reschedule(cpu); >> - else >> + if (set_nr_and_not_polling(curr, rs)) { >> + if (rs == RESCHED_eager) >> + smp_send_reschedule(cpu); > > I think you just broke things. > > Not all idle threads have POLLING support, in which case you need that > IPI to wake them up, even if it's LAZY. Yes, I was concerned about that too. But doesn't this check against the idle_sched_class in resched_curr() cover that? >> + if (IS_ENABLED(CONFIG_PREEMPT) || >> + (rq->curr->sched_class == &idle_sched_class)) { >> + rs = RESCHED_eager; >> + goto resched; >> + } else if (rs == RESCHED_eager) >> trace_sched_wake_idle_without_ipi(cpu); >> } > > > >> >> +/* >> + * resched_curr - mark rq's current task 'to be rescheduled' eagerly >> + * or lazily according to the current policy. >> + * >> + * Always schedule eagerly, if: >> + * >> + * - running under full preemption >> + * >> + * - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG) >> + * force TIF_NEED_RESCHED to be set and send a resched IPI. >> + * (the polling case has already set TIF_NEED_RESCHED via >> + * set_nr_if_polling()). >> + * >> + * - in userspace: run to completion semantics are only for kernel tasks >> + * >> + * Otherwise (regardless of priority), run to completion. >> + */ >> +void resched_curr(struct rq *rq) >> +{ >> + resched_t rs = RESCHED_lazy; >> + int context; >> + >> + if (IS_ENABLED(CONFIG_PREEMPT) || >> + (rq->curr->sched_class == &idle_sched_class)) { >> + rs = RESCHED_eager; >> + goto resched; >> + } >> + >> + /* >> + * We might race with the target CPU while checking its ct_state: >> + * >> + * 1. The task might have just entered the kernel, but has not yet >> + * called user_exit(). We will see stale state (CONTEXT_USER) and >> + * send an unnecessary resched-IPI. >> + * >> + * 2. The user task is through with exit_to_user_mode_loop() but has >> + * not yet called user_enter(). >> + * >> + * We'll see the thread's state as CONTEXT_KERNEL and will try to >> + * schedule it lazily. There's obviously nothing that will handle >> + * this need-resched bit until the thread enters the kernel next. >> + * >> + * The scheduler will still do tick accounting, but a potentially >> + * higher priority task waited to be scheduled for a user tick, >> + * instead of execution time in the kernel. >> + */ >> + context = ct_state_cpu(cpu_of(rq)); >> + if ((context == CONTEXT_USER) || >> + (context == CONTEXT_GUEST)) { >> + >> + rs = RESCHED_eager; >> + goto resched; >> + } > > Like said, this simply cannot be. You must not rely on the remote CPU > being in some state or not. Also, it's racy, you could observe USER and > then it enters KERNEL. Or worse. We might observe KERNEL and it enters USER. I think we would be fine if we observe USER: it would be upgrade to RESCHED_eager and send an unnecessary IPI. But if we observe KERNEL and it enters USER, then we will have set the need-resched-lazy bit which the thread might not see (it might have left exit_to_user_mode_loop()) until the next entry to the kernel. But, yes I would like to avoid the ct_state as well. But need-resched-lazy only makes sense when the task on the runqueue is executing in the kernel... -- ankur