Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Josh Don <joshdon@xxxxxxxxxx> · Tue, 14 May 2024 15:06:16 -0700

On Mon, May 13, 2024 at 1:04 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Sun, May 05, 2024 at 01:31:26PM -1000, Tejun Heo wrote:
>
> > > You Google/Facebook are touting collaboration, collaborate on fixing it.
> > > Instead of re-posting this over and over. After all, your main
> > > motivation for starting this was the cpu-cgroup overhead.
> >
> > The hierarchical scheduling overhead isn't the main motivation for us. We
> > can't use the CPU controller for all workloads and while it'd be nice to
> > improve that,
>
> Hurmph, I had the impression from the earlier threads that this ~5%
> cgroup overhead was most definitely a problem and a motivator for all
> this.
>
> The overhead was prohibitive, it was claimed, and you needed a solution.
> Did not previous versions use this very argument in order to push for
> all this?
>
> By improving the cgroup mess -- I very much agree that the cgroup thing
> is not very nice. This whole argument goes away and we all get a better
> cgroup implementation.

I talked with pjt to get some historical context on these patches, it
sounds like these were some advocated performance improvements but had
fairness issues that Paul pointed out. We're happy to help take a look
at this again, but this is all independent from the motivation for
sched_ext. Sounds like we're on the same page about this now though :)

Thus, cgroups are not a primary motivator for sched_ext. However, one
aspect of cgroups is made quite a bit nicer by the pluggable
scheduling. This is the fact that cgroups are a second class citizen
in CFS, because they are still a compile time option, so everything
must be built to support a thread-only model. That makes it really
hard to write group schedulers; fundamentally, task placement, load
balancing, etc. is operating on lists of tasks, not lists of cgroups.
As you can see, improving hierarchical performance of cgroups is nice
to have but not related to this goal.

> Writing a custom scheduler isn't that hard, simply ripping out
> fair_sched_class and replacing it with something simple really isn't
> *that* hard.

Getting that custom scheduler back into upstream is pretty hard
though. I like Chris' analogy of filesystems because it gives a really
good sense of what a bigger ecosystem might look like with schedulers.
It simply is not feasible to implement certain types of behavior in
CFS, because the behavior is too specialized for certain classes of
workloads, and the model/heuristics could not be made to work in a
general purpose scheduling environment, which CFS strives to achieve.
Taking these ideas and putting them into a different scheduling class
is also a bit of a non-starter. The scheduler is optimized to have
most tasks running in a single scheduling class (CFS), and adding
additional classes both adds additional static overhead, as well as
complexity from dealing with the need to ensure non-starvation of
lower priority sched classes (due to the strict priority ordering of
sched classes).

As an example, a few years ago Xi posted a simple new scheduling class
optimized for high frequency context switching
(https://lkml.org/lkml/2019/9/6/177), which was nack'd offline.
There's an argument to be made that some of the functionality there
could have been rolled into RT, but I think it serves as a good
example of the friction of adding even simple new sched classes.

Pluggable scheduling helps to fill the gap here for policies that have
elements making them not a good fit for CFS, given that we don't want
a plethora of new sched classes in the tree. That's not to say that
just because a policy cannot be fully integrated into CFS it has no
benefit for upstream contribution; there are pieces that might make
sense to adapt to CFS. For example, we've been experimenting with a
policy that schedules based on cgroups rather than simply tasks, and
uses that to provide better CCX locality for the cgroup, with better
control of how the group spills over to remote CCX. The group based
scheduling nature cannot be easily integrated into CFS, as described
above, but the CCX scheduling portion could find its way into CFS by
means of a more nuanced evaluation of migration_cost, or new placement
heuristics.

> But you can easily ignore cgroups, uclamp and a ton of other stuff and
> still boot and play around.

Please don't underestimate the value that swapping policies around at
runtime has. Sure, playing around in a VM isn't bad, but getting
performance data from real hardware can take quite a while between
boots, not including the time to actually restart the workload and
have it warmed up. We're talking several orders of magnitude more
latency in the iterative policy development and analysis portion here.
I imagine it would have been nice, for example, to swap in and out
successive versions of EEVDF while hackbench was actively running, and
observe how each swap out changes running averages of latency, etc.
Even in a world where we committed to having a single scheduler, I
think it would be nice if we had in-tree CFS BPF programs that
implemented several of the hooks, just for the purpose of improving
development velocity. What BPF has done for networking could be made
to improve heuristic heavy areas like select_task_rq, for example.

My point here is that allowing new sched policies is a big benefit,
but also placing those policies in BPF is another (separate) big
benefit that sched_ext provides natively.

Best,
Josh