Re: [PATCHSET v4] sched: Implement BPF extensible scheduler class

Josh Don <joshdon@xxxxxxxxxx> · Thu, 24 Aug 2023 17:26:45 -0700

Hello,

On Fri, Jul 21, 2023 at 11:37 AM Tejun Heo <tj@xxxxxxxxxx> wrote:
[snip]
> We are comfortable with the current API. Everything we tried fit pretty
> well. It will continue to evolve but sched_ext now seems mature enough for
> initial inclusion. I suppose lack of response doesn't indicate tacit
> agreement from everyone, so what are you guys all thinking?

I want to reiterate Google’s support for this proposal.

We’ve been experimenting with pluggable scheduling via our ghOSt
framework (https://github.com/google/ghost-kernel) for quite a while
now. A few things have become clearly evident.

(1) There is a very non-trivial level of headroom that can be taken
advantage of by directed policy that more closely specializes to the
types of workloads deployed on a machine. I can provide two direct
examples.
In Search, the backend application has intimate knowledge of thread
workloads and RPC deadlines, which it immediately communicates to our
BPF scheduler via BPF maps. We've used this info to construct a policy
that reduces context switches, decreases p99 latency, and increases
QPS by 5% in testing. The flexibility of expressiveness in terms of
priority goes far beyond what niceness or cpu.shares could achieve.

For VM workloads, we’ve been testing a policy that has virtually
eliminated our >10ms latency tails via a combination of deadline and
fair scheduling, using an approach inspired by Tableau
(https://arpangujarati.github.io/pdfs/eurosys2018_paper.pdf). I find
this case particularly appealing from a pluggable scheduling
perspective because it highlights an area in which specialization to
the type of workload (VMs, which prefer longer, gang scheduled,
uninterrupted, and predictable low-latency access to CPU) provides
clear benefits, yet is not appropriate for a general-purpose scheduler
like CFS.

(2) Sched knobs are incredibly useful, and tuning has real effects.
The scheduler exports various debugfs knobs to control behavior, such
as minimum granularity, overall sched latency, and migration cost.
These mostly get baked into the kernel with semi-arbitrary values.
But, experimentally, it makes a lot of sense to (for example) lower
the migration cost on certain combinations of hardware and workload,
taking a tradeoff to increase migration rate but reduce non-work
conserving behavior.

We’ve taken this idea further with an ML based system to automatically
find the best combination of sched knobs for a given workload, given a
goal such as to maximize QPS. This has resulted in gains of 2-5%; a
lot of performance to leave on the table simply due to using preset
defaults. Pluggable scheduling would further increase the surface area
of experimentation, and yield additional insight into what other
kernel heuristics could be improved. It was from the ML work that we
gleaned that migrating tasks in smaller batches, but more frequently,
was a better tradeoff than the default configuration.

(3) There are a number of really interesting scheduling ideas that
would be difficult or infeasible to bake into the kernel. One clear
example is core scheduling, which was quite a complex addition to the
kernel (for example, due to managing fairness of tasks spanning the
logical cpus of a core), but which has relatively straightforward
implementation in sched_ext and ghOSt (for example, in ghOSt, a single
cpu can issue a transaction to run tasks on both itself and its
sibling, achieving the needed security property of core scheduling.
Fairness follows easily because runqueues can easily be any shape in
userspace, such as per-core.).
Another interesting idea is to offload scheduling entirely from VM
cores in order to keep ticks stopped with NOHZ independent of the task
count, since preemptive scheduling can be driven by a remote core.

Moving forward, we’re planning to redesign our ghOSt userspace
infrastructure to work on top of the sched_ext kernel infrastructure.
We think there’s a lot of benefit to the sched_ext design, especially
the very tight BPF integration. We’re committed to the idea of
pluggable scheduling, and are in close collaboration with Meta to
advance this work while we simultaneously deploy it internally.

Best,
Josh