Linus! On Wed, Jun 19 2024 at 22:07, Linus Torvalds wrote: > On Wed, 19 Jun 2024 at 19:35, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >> >> When I sat there in Richmond with the sched_ext people I gave them very >> deep technical feedback especially on the way how they integrate it: >> >> Sprinkle hooks and callbacks all over the place until it works by some >> definition of works. > > Are we even talking about the same thing? Yes we do. > But "sprinkle hooks and callbacks all over the place"? There are too many places which add scx_***() invocations. That's what I asked to be cleaned up and to be generalized so it becomes uniform over the scheduler classes. Sure one could argue that some of these calls are at places where some other scheduling class dropped already one, but that's wrong because accumulating bad code just creates more technical debt. If cleaned up then the already existing hacks vanish into proper class callbacks, which improves the existing code and allows to drop in sched ext more naturally at the end. One example I very explicitely mentioned back then is the dance around fork(). It took me at least an hour last year to grok the convoluted logic and it did not get any faster when I stared at it today again. fork() sched_fork() scx_pre_fork() percpu_down_rwsem(&scx_fork_rwsem); if (dl_prio(p)) { ret = -EINVAL; goto cancel; // required to release the semaphore } sched_cgroup_fork() return scx_fork(); sched_post_fork() scx_post_fork() percpu_up_rwsem(&scx_fork_rwsem); Plus the extra scx_cancel_fork() which releases the scx_fork_rwsem in case that any call after sched_fork() fails. My head still spins from deciphering this once more. What has scx_fork() to do with sched_cgroup_fork()? It's completely non-obvious and the lack of comments does not help either. The changelog is handwaving at best. scx_pre_fork() takes the semaphore unconditionally independent of the scheduler class of the forking task and even in the case that no BPF scheduler is loaded or active. Why? Neither the changelog nor the lack of comments give any hint, which is also not a new complaint from me. A proper cleanup would just eliminate the unconditional down(), the dl_prio() cancel logic plus the whole if/elseif dance including the #ifdef SCHED_EXT. It's not rocket science to come up with the obvious: ret = p->sched_class->pre_fork(); if (ret) return ret; That's just proper engineering which is what some famous programmer tells people to do for a very long time: "I’m a huge proponent of designing your code around the data, rather than the other way around, ... I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships." And that's not only proper engineering it's also the other approach the same famous programmer tells people to do: "I want them to lull me into a safe and cozy world where the stuff they are pushing is actually useful to mainline people _first_." IOW, give me something which is useful to me _first_ so that you can add your particular flavor of crazy on top without bothering me. You obviously can complain now about the crazy people who actually listen to what that famous programmer is saying. :) But the above is not only true for that famous programmer personally, that's equally true for any maintainer who has to deal with the result of a submission for a long time. I'm still not seeing the general mainline people benefit of all this, so I have to trust you that there is one which is beyond my comprehension skills. That said, I'm more than wary about the hidden locking scheme of that percpu semaphore in the fork path, but that's a different design question to be debated on the way. There are some other technical details which need to be sorted including the interaction with other pending code, but that's something which can be solved as we move forward. Ideally this can be shaped in a way so that the scheduler becomes closer to being modular, which would be the real useful thing for research and not just the advertisment version of it. But wait a moment, that can't happen as pluggable schedulers have been rejected in the past: "I absolutely *detest* pluggable schedulers." Guess which famous programmer said that. > And scx_next_task_picked() isn't pretty - as far as I understand, it's > because there's only a "class X picked" callback ("pick_next_task()"), > and no way to tell other classes they weren't picked. > Could things like that next_active_class() perhaps be done more > prettily? I'm sure. Well spotted. > But I get the very strong feeling that people wanted to limit the > amount of changes they made to the core scheduler code. Which is exactly the point. If the existing code does not let your new feature fall into place, then refactor it so it does. Working around the short comings at some other place is patently wrong and that's not something new either. Requesting such refactoring is not an undue burden because the people who maintain the code will have to deal with the result. Unwrapping this stuff after the fact is the worst thing to do and I definitely have an expert opinion on this. None of this is rocket science and could have been done long ago even without me holding hands. >> I clearly offered you to try to resolve this amicably within a >> reasonable time frame. >> >> How exaclty is that equivalent to "continue to do nothing" ? > > So if we actually *can* resolve this amicably in three months, then > that sounds worth it. > > But my reaction is "what changed"? Nothing has become more amicable in > the last nine months. What makes the next three months special? The difference is: 1) Peter is back and the capacity problem is less bad than it was 2) As I explained before I unfortunately did not have cycles in the past _seven_ months, but I can make the cycles available now and drive this forward. That's possible when both sides are willing to cooperate. I'm sure there is enough incentive to do so. 3) I'm going to focus on the integration and interaction aspect and grudgingly leave the cgroup trainwreck out of it. If someone from the crowd who caused it actually had the courtesy to mop this up, that would be a great signal for everyone and it can be done in parallel. Not that I'm holding my breath, but I'm still a hopeless optimist. If you don't trust me on that, then we have a very different problem. Thanks, Thomas