Hello, On Fri, Jul 21, 2023 at 11:37 AM Tejun Heo <tj@xxxxxxxxxx> wrote: [snip] > We are comfortable with the current API. Everything we tried fit pretty > well. It will continue to evolve but sched_ext now seems mature enough for > initial inclusion. I suppose lack of response doesn't indicate tacit > agreement from everyone, so what are you guys all thinking? I want to reiterate Google’s support for this proposal. We’ve been experimenting with pluggable scheduling via our ghOSt framework (https://github.com/google/ghost-kernel) for quite a while now. A few things have become clearly evident. (1) There is a very non-trivial level of headroom that can be taken advantage of by directed policy that more closely specializes to the types of workloads deployed on a machine. I can provide two direct examples. In Search, the backend application has intimate knowledge of thread workloads and RPC deadlines, which it immediately communicates to our BPF scheduler via BPF maps. We've used this info to construct a policy that reduces context switches, decreases p99 latency, and increases QPS by 5% in testing. The flexibility of expressiveness in terms of priority goes far beyond what niceness or cpu.shares could achieve. For VM workloads, we’ve been testing a policy that has virtually eliminated our >10ms latency tails via a combination of deadline and fair scheduling, using an approach inspired by Tableau (https://arpangujarati.github.io/pdfs/eurosys2018_paper.pdf). I find this case particularly appealing from a pluggable scheduling perspective because it highlights an area in which specialization to the type of workload (VMs, which prefer longer, gang scheduled, uninterrupted, and predictable low-latency access to CPU) provides clear benefits, yet is not appropriate for a general-purpose scheduler like CFS. (2) Sched knobs are incredibly useful, and tuning has real effects. The scheduler exports various debugfs knobs to control behavior, such as minimum granularity, overall sched latency, and migration cost. These mostly get baked into the kernel with semi-arbitrary values. But, experimentally, it makes a lot of sense to (for example) lower the migration cost on certain combinations of hardware and workload, taking a tradeoff to increase migration rate but reduce non-work conserving behavior. We’ve taken this idea further with an ML based system to automatically find the best combination of sched knobs for a given workload, given a goal such as to maximize QPS. This has resulted in gains of 2-5%; a lot of performance to leave on the table simply due to using preset defaults. Pluggable scheduling would further increase the surface area of experimentation, and yield additional insight into what other kernel heuristics could be improved. It was from the ML work that we gleaned that migrating tasks in smaller batches, but more frequently, was a better tradeoff than the default configuration. (3) There are a number of really interesting scheduling ideas that would be difficult or infeasible to bake into the kernel. One clear example is core scheduling, which was quite a complex addition to the kernel (for example, due to managing fairness of tasks spanning the logical cpus of a core), but which has relatively straightforward implementation in sched_ext and ghOSt (for example, in ghOSt, a single cpu can issue a transaction to run tasks on both itself and its sibling, achieving the needed security property of core scheduling. Fairness follows easily because runqueues can easily be any shape in userspace, such as per-core.). Another interesting idea is to offload scheduling entirely from VM cores in order to keep ticks stopped with NOHZ independent of the task count, since preemptive scheduling can be driven by a remote core. Moving forward, we’re planning to redesign our ghOSt userspace infrastructure to work on top of the sched_ext kernel infrastructure. We think there’s a lot of benefit to the sched_ext design, especially the very tight BPF integration. We’re committed to the idea of pluggable scheduling, and are in close collaboration with Meta to advance this work while we simultaneously deploy it internally. Best, Josh