On Mon, Jan 29, 2024 at 05:42:54PM -0500, Joel Fernandes wrote: > Tejun's address bounced so I am adding the correct one. Thanks. Ah, thanks, my mistake. > > On 1/29/2024 5:41 PM, Joel Fernandes wrote: > > > > > > On 1/26/2024 4:59 PM, David Vernet wrote: > >> Hello, > >> > >> A few more use cases have emerged for sched_ext that are not yet > >> supported that I wanted to discuss in the BPF track. Specifically: > >> > >> - EAS: Energy Aware Scheduling > >> > >> While firmware ultimately controls the frequency of a core, the kernel > >> does provide frequency scaling knobs such as EPP. It could be useful for > >> BPF schedulers to have control over these knobs to e.g. hint that > >> certain cores should keep a lower frequency and operate as E cores. > >> This could have applications in battery-aware devices, or in other > >> contexts where applications have e.g. latency-sensitive > >> compute-intensive workloads. > > > > This is a great topic. I think integrating/merging such mechanism with the NEST > > scheduler could be useful too? You mentioned there is sched_ext implementation > > of NEST already? One reason that's interesting to me is the task-packing and Correct -- it's called scx_nest [0]. [0]: https://github.com/sched-ext/scx/blob/main/scheds/c/scx_nest.bpf.c > > less-spreading may have power benefits, this is exactly what EAS on ARM does, > > but it also uses an energy model to know when packing is a bad idea. Since we > > don't have fine grained control of frequency on Intel, I wonder what else can we > > do to know when the scheduler should pack and when to spread. Maybe something > > simple which does not need an energy model but packs based on some other > > signal/heuristic would be great in the short term. Makes sense. What kinds of signals were you thinking? We can have user space query for whatever we'd need, and then communicate that to the kernel via shared maps. Or probably even more ideal, if we could get the information we need from tracepoints or kprobes, then we could possibly avoid having to deal with that and just keep everything in the kernel. Note that we don't have to necessarily just track public APIs if we did all of this in the kernel. If we can access a struct in a tracepoint or a kprobe, we can read from it, and use that in the scheduler however we want. Of course, none of this comes with any kind of ABI stability guarantees, but that's one of the features of sched_ext: because the actual scheduler itself is a _kernel_ program that runs in kernel space, we can experiment with and implement things without tying anyone's hands to fully supporting it in the kernel forever. The user space portion communicates with the BPF scheduler over maps that are UAPI (part of BPF UAPI), but the actual scheduler itself is just a kernel program, and therefore is free to interact with the rest of the system without making anything UAPI or adding ABI stability requirements. The contents of what's passed over those maps are not UAPI, in the same manner that the contents sent over the communication channels setup by KVM per your other thread [1] would not be UAPI. [1]: https://lore.kernel.org/all/653c2448-614e-48d6-af31-c5920d688f3e@xxxxxxxxxxxxxxxxx/ > > Maybe a signal can be the "Quality of service (QoS)" approach where tasks with > > lower QoS are packed more aggressively and higher QoS are spread more (?). > > > >> > >> - Componentized schedulers > >> > >> Scheduler implementations today largely have to reinvent the wheel. For > >> example, if you want to implement a load balancer in rust, you need to > >> add the necessary fields to the BPF program for tracking load / duty > >> cycle, and then parse and consume them from the rust side. That's pretty > >> suboptimal though, as the actual load balancing algorithm itself is > >> essentially the exact same. The challenge here is that the feature > >> requires both BPF and user space components to work together. It's not > >> enough to ship a rust crate -- you need to also ship a BPF object file > > > > Maybe I am confused but why does rust userspace code need to link to BPF > > objects? The BPF object is loaded into the kernel right? So there are a few pieces at play here: 1. You're correct that the BPF program is loaded into kernel space, but the actual BPF bytecode itself is linked statically into the application, and the application is what actually makes the syscalls (via libbpf) to load the BPF program into the kernel. Here's a high-level overview of the workflow for loading a scheduler: - Open the scheduler: This involves libbpf parsing the BPF object file passed by the application, and discovering its maps, progs, etc which should be created. At this phase user space can still update any maps in the program, including e.g. read-only maps such as .rodata. This allows user space to do things like set the max # of CPUs on the system, set debug flags if they were requested by the user, etc. - Load the scheduler: Libbpf creates BPF maps, does relocations for CO-RE [2], and verifies and loads the scheduler into the kernel. At this point, the program is loaded into the kernel, but the scheduler is not actively running yet. User space can no longer write read-only maps in the BPF program, but it can still read and write _writeable_ maps, and it can in fact do so indefinitely throughout the runtime of the scheduler. As described below, this is why we need both a user space and a BPF object file portion for such features. - Attach the scheduler: This actually calls into ext.c to update the currently running scheduler to use the BPF sched_ext scheduler. [2]: https://nakryiko.com/posts/bpf-core-reference-guide/ 2. As alluded to above, the user space program that loaded the scheduler can interact with the scheduler in real time by reading and writing to its writeable maps. This allows user space to e.g. read some procfs values to determine utilization for each core in the system, do some load balancing math with floating point numbers basad on that data and on task weight / duty cycle, and then notify the BPF scheduler that is should migrate tasks by writing to shared maps. This is exactly what we do in scx_rusty [3]. We track duty cycles and load in kernel space (soon we'll only track duty cycles and do all load scaling in user space), and then periodically we'll do a load balancing pass in the user-space portion of the scheduler where we read those values, use floats, and then signal to the kernel if and where it should migrate tasks by writing to maps. This is all done async from the perspective of the kernel, so the kernel will check the maps to see if there's an update on e.g. enqueue paths. [3]: https://github.com/sched-ext/scx/tree/main/scheds/rust/scx_rusty/src So to summarize -- the rust portion isn't running in the kernel, but it is influencing the kernel scheduler's decisions by communicating with it via these shared maps (and the kernel can similarly communicate with user space in the opposite direction). That's the reason that it needs to have both the user space portion and the kernel portion available to implement these features. Neither makes sense without the other. Note that not every scheduler we've implemented has a robust user space portion, but every scheduler does have _some_ user space counterpart which is responsible for loading it. scx_nest.c [4], for example, doesn't really do anything in user space other than periodically print out some data that's exported to it from the kernel scheduler via a shared map. If we wanted to add user-space load balancing to scx_nest, the same requirements would apply as for schedulers with a rust user-space component: we'd need both a user space portion, and a kernel-space portion. [4]: https://github.com/sched-ext/scx/blob/main/scheds/c/scx_nest.c#L195 > >> that your program can link against. And what should the API look like on > >> both ends? Should rust / BPF have to call into functions to get load > >> balancing? Or should it be automatically packaged and implemented? > >> > >> There are a lot of ways that we can approach this, and it probably > >> warrants discussing in some more detail > > > > But I get the gist of the issue, would be interesting to discuss. Sounds great, thanks for reading this over. - David
Attachment:
signature.asc
Description: PGP signature