On Tue, May 14, 2024 at 01:07:15AM +0100, Qais Yousef wrote: [...] > > > > > > How does this BPF muck translate into better quality patches for me? > > > > Here's how we will be using it (we will likely be porting sched_ext to > > ChromeOS regardless of its acceptance). > > > > Doing testing of scheduler changes in the field is extremely time > > consuming and complex. We tested EEVDF vs CFS by backporting EEVDF to > > 5.15 (as that is the kernel version we are using on the chromebooks we > > were testing on), and then we need to add a user space "switch" to > > change the scheduler. Note, this also risks causing a bug in adding > > these changes. Then we push the kernel out, and then start our > > experiment that enables our feature to a small percentage, and slowly > > increases the number of users until we have a enough for a statistical > > result. > > > > What sched_ext would give us is a easy way to try different scheduling > > algorithms and get feedback much quicker. Once we determine a solution > > that improves things, we would then spend the time to implement it in > > the scheduler, and yes, send it upstream. > > > > To me, sched_ext should never be the final solution, but it can be > > extremely useful in testing various changes quickly in the field. Which > > to me would encourage more contributions. Hello Qais, [...] > I really don't buy the rapid development aspect too. The scheduler was heavily There are already several examples from users who have shown that the rapid development and experimentation is extremely useful. Imagine if you're iterating on the scheduler to improve p99 frame rates on the Steam Deck, as Changwoo described. It's much more efficient to be able to just tweak and load a BPF scheduler (that is safe and can't crash the machine) to try some random idea out than it is to: 1. Tweak and recompile the kernel 2. Reinstall the kernel on the Steam Deck 3. Reboot the Steam Deck 4. Reload a game and let caches rewarm 5. Measure FPS You're talking about a 5 second compile job + 1 second to reload a safe BPF scheduler vs. having to do all of the above steps _and_ potentially making a mistake that brings the machine down. These benefits are also extremely useful for testing workloads on production servers, etc. Let’s also not forget that unlike many other kernel features, you probably can’t get reliable scheduling results from running in a VM. The experimentation overhead is very real. [...] > influenced by the early contributors which come from server market that had > (few) very specific workloads they needed to optimize for and throughput had > a heavier weight vs latency. Fast forward to now, things are different. Even on > server market latency/responsiveness has become more important. Power and > thermal are important on a larger class of systems now too. I'd dare say even > on server market. How do you know when it's okay for an app/task to consume too > much power and when it is not? Hint hint, you can't unless someone in userspace > tells you. Similarly for latency vs throughput. What is the correct way to > write an application to provide this info? Then we can ask what is missing in > the scheduler to enable this. Hmm, you seem to be arguing that the way forward here is to have our one general purpose scheduler be entirely driven by user space hinting. Assuming I’m not misunderstanding you, I strongly disagree with this sentiment. User space hinting can be powerful, but I think we need to have a general purpose scheduler that's completely agnostic to whatever is running in user space. We’ve also been able to get strong results from sched_ext schedulers that don’t use any user space hinting. Also, even if this ended up being the way forward, I don’t see it being practical to implement. Wouldn’t it require us to update all of user space globally just to update how it interfaces with the scheduler? [...] > Note the original min/wakeup_granularity_ns, latency_ns etc were tuned by > default for throughput by the way (server market bias). You can manipulate > those and get better latencies. Those knobs aren't available anymore in EEVDF. [...] > point IMO, not the scheduler algorithm. If the latter need to change, it needs > to be as the result of this friction - which what EEVDF came about from to my > understanding. To enable implementing a latency interface easier. But Vincent > had a working implementation with CFS too which I think would have worked fine > by the way. This friction is nothing new. It's why we already find ourselves in the unfortunate position of having a large corpus of out of tree scheduler patches. If there is a lot of performance being left on the table, vendors are going to find a way to get that performance. Corporations don't need our consent to ship kernels with custom schedulers on their devices. They've already been doing it for years, and it's ultimately the users who suffer. I genuinely believe that the fair.c scheduler will benefit from being able to apply ideas conceived in a sched_ext scheduler which end up working well for general use cases. For example, in scx_rusty, we’re able to get very good interactivity [0] by determining a task’s deadline as a function of its average runtime (along with some other great ideas that Changwoo first added to scx_lavd) rather than from its eligibility + slice as with what EEVDF does. Over the course of a day or two, I tried way more ideas that didn’t work than would have been possible in that time frame than with a recompile-reboot cycle, and ended up finding one that seems to work very well. It would be awesome if these ideas were added to EEVDF so that everyone can benefit. [0]: https://drive.google.com/file/d/1fyHt9BYGha6apl7HAkibwpy52UTi8-AQ/view?usp=drive_link Thanks, David
Attachment:
signature.asc
Description: PGP signature