Hello, On Mon, Dec 12, 2022 at 01:53:55PM +0100, Peter Zijlstra wrote: > Perhaps the saner way to write this is: > > #ifdef CONFIG_SMP > BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class)); > #endif > BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class)); > BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class)); > BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class)); > #ifdef CONFIG_... > BUG_ON(!sched_class_above(&fair_sched_class, &ext_sched_class)); > BUG_ON(!sched_class_above(&ext_sched_class, &idle_sched_class)); > #endif Yeah, this looks way better. Will update. > > +static inline const struct sched_class *next_active_class(const struct sched_class *class) > > +{ > > + class++; > > + if (!scx_enabled() && class == &ext_sched_class) > > + class++; > > + return class; > > +} > > + > > +#define for_active_class_range(class, _from, _to) \ > > + for (class = (_from); class != (_to); class = next_active_class(class)) > > + > > +#define for_each_active_class(class) \ > > + for_active_class_range(class, __sched_class_highest, __sched_class_lowest) > > + > > +/* > > + * SCX requires a balance() call before every pick_next_task() call including > > + * when waking up from idle. > > + */ > > +#define for_balance_class_range(class, prev_class, end_class) \ > > + for_active_class_range(class, (prev_class) > &ext_sched_class ? \ > > + &ext_sched_class : (prev_class), (end_class)) > > + > > This seems quite insane; why not simply make the ext methods effective > no-ops? Both balance and pick explicitly support that already, no? Yeah, we can definitely do that. It's just nice to guarantee from the core code that we aren't calling into a sched class which isn't enabled at the moment. If you take a look at "[PATCH 20/31] sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext", it's used the other way around too to elide calling into CFS when it knows that there are no tasks there. If the core code doesn't elide the calls, we might need some gating in CFS too. > > @@ -5800,10 +5812,13 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, > > * We can terminate the balance pass as soon as we know there is > > * a runnable task of @class priority or higher. > > */ > > - for_class_range(class, prev->sched_class, &idle_sched_class) { > > + for_balance_class_range(class, prev->sched_class, &idle_sched_class) { > > if (class->balance(rq, prev, rf)) > > break; > > } > > +#else > > + /* SCX needs the balance call even in UP, call it explicitly */ > > This, *WHY* !?! This comes from the fact that there are no strict rq boundaries and the BPF scheduler can share scheduling queues across any subset of CPUs however they see fit. So, task <-> rq association is flexible until the very last moment the task gets picked for execution. For the dispatch path to support this, it needs to be able to migrate tasks across rq's which requires unlocking the current rq which can only be done in ->balance(). So, sched_ext uses ->balance() to run the dispatch path and thus needs it called even on UP. Given that UP doesn't need to transfer tasks across, it might be possible to move the whole dispatch operation into ->pick_next_task() but the current state would be different, so it's more complicated and will likely be more brittle. > > @@ -5876,7 +5894,7 @@ static inline struct task_struct *pick_task(struct rq *rq) > > const struct sched_class *class; > > struct task_struct *p; > > > > - for_each_class(class) { > > + for_each_active_class(class) { > > p = class->pick_task(rq); > > if (p) > > return p; > > > But this.. afaict that means that: > > - the whole EXT thing is incompatible with SCHED_CORE Can you expand on why this would be? I didn't test against SCHED_CORE, so am sure things might be broken but can't think of a reason why it'd be fundamentally incompatible. > - the whole EXT thing can be trivially starved by the presence of a > single CFS/BATCH/IDLE task. It's a simliar situation w/ RT vs. CFS, which is resolved via RT having starvation avoidance. Here, the way it's handled is a bit different, SCX has a watchdog mechanism implemented in "[PATCH 18/31] sched_ext: Implement runnable task stall watchdog", so if SCX tasks hang for whatever reason including being starved by CFS, it will get aborted and all tasks will be handed back to CFS. IOW, it's treated like any other BPF scheduler errors that can lead to stalls and recovered the same way. Thanks. -- tejun