Re: [PATCH 14/31] sched_ext: Implement BPF extensible scheduler class

Tejun Heo <tj@xxxxxxxxxx> · Mon, 12 Dec 2022 11:33:12 -1000

Hello,

On Mon, Dec 12, 2022 at 01:53:55PM +0100, Peter Zijlstra wrote:
> Perhaps the saner way to write this is:
> 
> #ifdef CONFIG_SMP
> 	BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class));
> #endif
> 	BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class));
> 	BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class));
> 	BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class));
> #ifdef CONFIG_...
> 	BUG_ON(!sched_class_above(&fair_sched_class, &ext_sched_class));
> 	BUG_ON(!sched_class_above(&ext_sched_class, &idle_sched_class));
> #endif

Yeah, this looks way better. Will update.

> > +static inline const struct sched_class *next_active_class(const struct sched_class *class)
> > +{
> > +	class++;
> > +	if (!scx_enabled() && class == &ext_sched_class)
> > +		class++;
> > +	return class;
> > +}
> > +
> > +#define for_active_class_range(class, _from, _to)				\
> > +	for (class = (_from); class != (_to); class = next_active_class(class))
> > +
> > +#define for_each_active_class(class)						\
> > +	for_active_class_range(class, __sched_class_highest, __sched_class_lowest)
> > +
> > +/*
> > + * SCX requires a balance() call before every pick_next_task() call including
> > + * when waking up from idle.
> > + */
> > +#define for_balance_class_range(class, prev_class, end_class)			\
> > +	for_active_class_range(class, (prev_class) > &ext_sched_class ?		\
> > +			       &ext_sched_class : (prev_class), (end_class))
> > +
> 
> This seems quite insane; why not simply make the ext methods effective
> no-ops? Both balance and pick explicitly support that already, no?

Yeah, we can definitely do that. It's just nice to guarantee from the core
code that we aren't calling into a sched class which isn't enabled at the
moment. If you take a look at "[PATCH 20/31] sched_ext: Allow BPF schedulers
to switch all eligible tasks into sched_ext", it's used the other way around
too to elide calling into CFS when it knows that there are no tasks there.
If the core code doesn't elide the calls, we might need some gating in CFS
too.

> > @@ -5800,10 +5812,13 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
> >  	 * We can terminate the balance pass as soon as we know there is
> >  	 * a runnable task of @class priority or higher.
> >  	 */
> > -	for_class_range(class, prev->sched_class, &idle_sched_class) {
> > +	for_balance_class_range(class, prev->sched_class, &idle_sched_class) {
> >  		if (class->balance(rq, prev, rf))
> >  			break;
> >  	}
> > +#else
> > +	/* SCX needs the balance call even in UP, call it explicitly */
> 
> This, *WHY* !?!

This comes from the fact that there are no strict rq boundaries and the BPF
scheduler can share scheduling queues across any subset of CPUs however they
see fit. So, task <-> rq association is flexible until the very last moment
the task gets picked for execution. For the dispatch path to support this,
it needs to be able to migrate tasks across rq's which requires unlocking
the current rq which can only be done in ->balance(). So, sched_ext uses
->balance() to run the dispatch path and thus needs it called even on UP.

Given that UP doesn't need to transfer tasks across, it might be possible to
move the whole dispatch operation into ->pick_next_task() but the current
state would be different, so it's more complicated and will likely be more
brittle.

> > @@ -5876,7 +5894,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
> >  	const struct sched_class *class;
> >  	struct task_struct *p;
> >  
> > -	for_each_class(class) {
> > +	for_each_active_class(class) {
> >  		p = class->pick_task(rq);
> >  		if (p)
> >  			return p;
> 
> 
> But this.. afaict that means that:
> 
>  - the whole EXT thing is incompatible with SCHED_CORE

Can you expand on why this would be? I didn't test against SCHED_CORE, so am
sure things might be broken but can't think of a reason why it'd be
fundamentally incompatible.

>  - the whole EXT thing can be trivially starved by the presence of a
>    single CFS/BATCH/IDLE task.

It's a simliar situation w/ RT vs. CFS, which is resolved via RT having
starvation avoidance. Here, the way it's handled is a bit different, SCX has
a watchdog mechanism implemented in "[PATCH 18/31] sched_ext: Implement
runnable task stall watchdog", so if SCX tasks hang for whatever reason
including being starved by CFS, it will get aborted and all tasks will be
handed back to CFS. IOW, it's treated like any other BPF scheduler errors
that can lead to stalls and recovered the same way.

Thanks.

-- 
tejun