Re: [RFC PATCH 41/86] sched: handle resched policy in resched_curr()

Ankur Arora <ankur.a.arora@xxxxxxxxxx> · Wed, 08 Nov 2023 02:26:37 -0800

Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:

> On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:
>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
>>  }
>>
>>  /*
>> - * resched_curr - mark rq's current task 'to be rescheduled now'.
>> + * __resched_curr - mark rq's current task 'to be rescheduled'.
>>   *
>> - * On UP this means the setting of the need_resched flag, on SMP it
>> - * might also involve a cross-CPU call to trigger the scheduler on
>> - * the target CPU.
>> + * On UP this means the setting of the need_resched flag, on SMP, for
>> + * eager resched it might also involve a cross-CPU call to trigger
>> + * the scheduler on the target CPU.
>>   */
>> -void resched_curr(struct rq *rq)
>> +void __resched_curr(struct rq *rq, resched_t rs)
>>  {
>>  	struct task_struct *curr = rq->curr;
>>  	int cpu;
>> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
>>  	cpu = cpu_of(rq);
>>
>>  	if (cpu == smp_processor_id()) {
>> -		set_tsk_need_resched(curr, RESCHED_eager);
>> -		set_preempt_need_resched();
>> +		set_tsk_need_resched(curr, rs);
>> +		if (rs == RESCHED_eager)
>> +			set_preempt_need_resched();
>>  		return;
>>  	}
>>
>> -	if (set_nr_and_not_polling(curr, RESCHED_eager))
>> -		smp_send_reschedule(cpu);
>> -	else
>> +	if (set_nr_and_not_polling(curr, rs)) {
>> +		if (rs == RESCHED_eager)
>> +			smp_send_reschedule(cpu);
>
> I think you just broke things.
>
> Not all idle threads have POLLING support, in which case you need that
> IPI to wake them up, even if it's LAZY.

Yes, I was concerned about that too. But doesn't this check against the
idle_sched_class in resched_curr() cover that?

>> +	if (IS_ENABLED(CONFIG_PREEMPT) ||
>> +	    (rq->curr->sched_class == &idle_sched_class)) {
>> +		rs = RESCHED_eager;
>> +		goto resched;

>> +	} else if (rs == RESCHED_eager)
>>  		trace_sched_wake_idle_without_ipi(cpu);
>>  }
>
>
>
>>
>> +/*
>> + * resched_curr - mark rq's current task 'to be rescheduled' eagerly
>> + * or lazily according to the current policy.
>> + *
>> + * Always schedule eagerly, if:
>> + *
>> + *  - running under full preemption
>> + *
>> + *  - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG)
>> + *    force TIF_NEED_RESCHED to be set and send a resched IPI.
>> + *    (the polling case has already set TIF_NEED_RESCHED via
>> + *     set_nr_if_polling()).
>> + *
>> + *  - in userspace: run to completion semantics are only for kernel tasks
>> + *
>> + * Otherwise (regardless of priority), run to completion.
>> + */
>> +void resched_curr(struct rq *rq)
>> +{
>> +	resched_t rs = RESCHED_lazy;
>> +	int context;
>> +
>> +	if (IS_ENABLED(CONFIG_PREEMPT) ||
>> +	    (rq->curr->sched_class == &idle_sched_class)) {
>> +		rs = RESCHED_eager;
>> +		goto resched;
>> +	}
>> +
>> +	/*
>> +	 * We might race with the target CPU while checking its ct_state:
>> +	 *
>> +	 * 1. The task might have just entered the kernel, but has not yet
>> +	 * called user_exit(). We will see stale state (CONTEXT_USER) and
>> +	 * send an unnecessary resched-IPI.
>> +	 *
>> +	 * 2. The user task is through with exit_to_user_mode_loop() but has
>> +	 * not yet called user_enter().
>> +	 *
>> +	 * We'll see the thread's state as CONTEXT_KERNEL and will try to
>> +	 * schedule it lazily. There's obviously nothing that will handle
>> +	 * this need-resched bit until the thread enters the kernel next.
>> +	 *
>> +	 * The scheduler will still do tick accounting, but a potentially
>> +	 * higher priority task waited to be scheduled for a user tick,
>> +	 * instead of execution time in the kernel.
>> +	 */
>> +	context = ct_state_cpu(cpu_of(rq));
>> +	if ((context == CONTEXT_USER) ||
>> +	    (context == CONTEXT_GUEST)) {
>> +
>> +		rs = RESCHED_eager;
>> +		goto resched;
>> +	}
>
> Like said, this simply cannot be. You must not rely on the remote CPU
> being in some state or not. Also, it's racy, you could observe USER and
> then it enters KERNEL.

Or worse. We might observe KERNEL and it enters USER.

I think we would be fine if we observe USER: it would be upgrade
to RESCHED_eager and send an unnecessary IPI.

But if we observe KERNEL and it enters USER, then we will have
set the need-resched-lazy bit which the thread might not see
(it might have left exit_to_user_mode_loop()) until the next
entry to the kernel.

But, yes I would like to avoid the ct_state as well. But
need-resched-lazy only makes sense when the task on the runqueue
is executing in the kernel...

--
ankur