On Mon, May 13, 2024 at 1:04 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Sun, May 05, 2024 at 01:31:26PM -1000, Tejun Heo wrote: > > > > You Google/Facebook are touting collaboration, collaborate on fixing it. > > > Instead of re-posting this over and over. After all, your main > > > motivation for starting this was the cpu-cgroup overhead. > > > > The hierarchical scheduling overhead isn't the main motivation for us. We > > can't use the CPU controller for all workloads and while it'd be nice to > > improve that, > > Hurmph, I had the impression from the earlier threads that this ~5% > cgroup overhead was most definitely a problem and a motivator for all > this. > > The overhead was prohibitive, it was claimed, and you needed a solution. > Did not previous versions use this very argument in order to push for > all this? > > By improving the cgroup mess -- I very much agree that the cgroup thing > is not very nice. This whole argument goes away and we all get a better > cgroup implementation. I talked with pjt to get some historical context on these patches, it sounds like these were some advocated performance improvements but had fairness issues that Paul pointed out. We're happy to help take a look at this again, but this is all independent from the motivation for sched_ext. Sounds like we're on the same page about this now though :) Thus, cgroups are not a primary motivator for sched_ext. However, one aspect of cgroups is made quite a bit nicer by the pluggable scheduling. This is the fact that cgroups are a second class citizen in CFS, because they are still a compile time option, so everything must be built to support a thread-only model. That makes it really hard to write group schedulers; fundamentally, task placement, load balancing, etc. is operating on lists of tasks, not lists of cgroups. As you can see, improving hierarchical performance of cgroups is nice to have but not related to this goal. > Writing a custom scheduler isn't that hard, simply ripping out > fair_sched_class and replacing it with something simple really isn't > *that* hard. Getting that custom scheduler back into upstream is pretty hard though. I like Chris' analogy of filesystems because it gives a really good sense of what a bigger ecosystem might look like with schedulers. It simply is not feasible to implement certain types of behavior in CFS, because the behavior is too specialized for certain classes of workloads, and the model/heuristics could not be made to work in a general purpose scheduling environment, which CFS strives to achieve. Taking these ideas and putting them into a different scheduling class is also a bit of a non-starter. The scheduler is optimized to have most tasks running in a single scheduling class (CFS), and adding additional classes both adds additional static overhead, as well as complexity from dealing with the need to ensure non-starvation of lower priority sched classes (due to the strict priority ordering of sched classes). As an example, a few years ago Xi posted a simple new scheduling class optimized for high frequency context switching (https://lkml.org/lkml/2019/9/6/177), which was nack'd offline. There's an argument to be made that some of the functionality there could have been rolled into RT, but I think it serves as a good example of the friction of adding even simple new sched classes. Pluggable scheduling helps to fill the gap here for policies that have elements making them not a good fit for CFS, given that we don't want a plethora of new sched classes in the tree. That's not to say that just because a policy cannot be fully integrated into CFS it has no benefit for upstream contribution; there are pieces that might make sense to adapt to CFS. For example, we've been experimenting with a policy that schedules based on cgroups rather than simply tasks, and uses that to provide better CCX locality for the cgroup, with better control of how the group spills over to remote CCX. The group based scheduling nature cannot be easily integrated into CFS, as described above, but the CCX scheduling portion could find its way into CFS by means of a more nuanced evaluation of migration_cost, or new placement heuristics. > But you can easily ignore cgroups, uclamp and a ton of other stuff and > still boot and play around. Please don't underestimate the value that swapping policies around at runtime has. Sure, playing around in a VM isn't bad, but getting performance data from real hardware can take quite a while between boots, not including the time to actually restart the workload and have it warmed up. We're talking several orders of magnitude more latency in the iterative policy development and analysis portion here. I imagine it would have been nice, for example, to swap in and out successive versions of EEVDF while hackbench was actively running, and observe how each swap out changes running averages of latency, etc. Even in a world where we committed to having a single scheduler, I think it would be nice if we had in-tree CFS BPF programs that implemented several of the hooks, just for the purpose of improving development velocity. What BPF has done for networking could be made to improve heuristic heavy areas like select_task_rq, for example. My point here is that allowing new sched policies is a big benefit, but also placing those policies in BPF is another (separate) big benefit that sched_ext provides natively. Best, Josh