Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 21 Jun 2024 09:34:22 -0700

On Fri, 21 Jun 2024 at 02:35, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >
> > But secondly, the "keep things out" is itself counter-productive.
>
> Says the one who kept asking me repeatedly whether I can't keep the
> remaining stuff of RT out of tree forever. The last time you asked that
> was not that long ago.

Thomas, you are starting to just make made-up arguments now.

Did I require that the RT patches be done right over two decades? Hell
yes. But those RT patches didn't change some single subsystem, they
made fundamental changes to the most core things out there.

The RT patches made something as fundamental and core as "disable
interrupts for spinlocks" be a special magical thing that normal
people weren't allowed to do, and that turned into a sleeping lock.
They affected _everything_. They very literally affected subsystems
and rules that had been there since linux-0.01 (not the spinlocks -
they came later - but things like tty / printk were in fact some of
the very first things written).

And don't get me wrong - I'm not complaining about the RT patches. I
think they improved things enormously in the end. They've been great.

I'm just saying that they are _not_ the norm to compare against.

The sched_ext patches? Not even remotely in the same class. The
sched_ext patches are more like fuse - another filesystem, just a
slightly odd one that has that big user space component.

And yes, fuse caused worries too because of the whole "filesystem
development in user space" outside the confines of the normal kernel
development model. It mostly just was (and is) a shim-layer that asks
user space for policy, while still using the kernel infrastructure for
most real work. Did those worries actually materialize? No. No they
did not.

Now, fuse to some degree was easier, because while it was merged about
two decades ago, even by that time we already had over a decade of
pluggable filesystems. So the VFS layer had a lot of pluggability
already, and it wasn't very hardcoded.

The scheduler also has the scheduler class pluggability, but it *is*
fairly hardcoded. So in addition to the kinds of issues FUSE had (with
all the "impedance matching" between the kernel interface and the user
level interfaces), the sched_ext patches have some of that hardcoded
pluggability that it needs to fix up.

So maybe a better comparison would be autofs - but autofs was added
_so_ long ago that I only vaguely remember the odd pain points for
waiting for mount points. Over the years, all of that special sauce
that no other filesystem needs has become such an integral part of the
vfs layer that people don't even think about it any more.

You can most definitely still see the effects of autofs - as opposed
to "regular" filesystems - on the vfs layer if you go look (things
like "finish_automount()" and friends), but it's been integrated for
so long that it's a non-issue. But back in the day, it all needed what
at the time was some special glue.

Anyway, what I'm saying is that you trying to equate this with the RT
patches is absolutely laughable and intellectually dishonest.

Look, ignoring the actual sched_ext code itself (and the examples,
documentation and test-cases which I certainly hope nobody would
object to), the actual core footprint ot fht sched_ext patches is
this:

 MAINTAINERS                       |  13 +++
 Makefile                          |   8 +-
 drivers/tty/sysrq.c               |   1 +
 include/asm-generic/vmlinux.lds.h |   1 +
 include/linux/cgroup.h            |   4 +-
 include/linux/sched.h             |   5 +
 include/linux/sched/task.h        |   3 +-
 include/uapi/linux/sched.h        |   1 +
 init/init_task.c                  |  12 +++
 kernel/Kconfig.preempt            |  24 +++++
 kernel/fork.c                     |  17 +++-
 kernel/sched/build_policy.c       |  10 ++
 kernel/sched/core.c               | 187 ++++++++++++++++++++++++++++++--------
 kernel/sched/debug.c              |   3 +
 kernel/sched/ext.h                | 114 +++++++++++++++++++++++
 kernel/sched/fair.c               |  21 ++---
 kernel/sched/idle.c               |   2 +
 kernel/sched/sched.h              |  75 ++++++++++++++-
 kernel/sched/syscalls.c           |  26 ++++++
 lib/dump_stack.c                  |   1 +
 20 files changed, 464 insertions(+), 64 deletions(-)

and honestly, I went through it all. None of it looks really
objectionable. Some of it is trivial cleanups and makes the code look
better (the "refactor" parts).

And yes, some of it is "We have a new scheduler class that wants more".

Realistically, of the above, I think the *only* parts that are even at
all halfway interesting are these:

 kernel/sched/core.c               | 187 ++++++++++++++++++++++++++++++--------
 kernel/sched/ext.h                | 114 +++++++++++++++++++++++

and that ext.h is on that "interesting" list only because of
for_each_active_class(), and I think it should probably just be in
core.c.

See why I do not think that this is AT ALL comparable to something
like the RT patches.

And yes, I do think code should be cleaned up, but the two cleanups
that struck me personally were literally just

 (a) scx_next_task_picked, where I slept on it, came up with a
three-line alternative suggestion, and Tejun sent the "real patch"
almost immediately

 (b) the for_each_active_class() thing that I think would actually be
better off just being done explicitly in sched/core.c, but probably
only makes sense after integration

and it really strikes me as "neither of these issues were worth nine
months of delay".

Although hopefully the nine months weren't entirely unproductive, and
maybe the patches have improved. Knock wood.

Because *most* of the two files above are actually normal stuff.
Things that other scheduler classes already do. Sometimes just with an
#ifdef around it (although many of the ifdef's are basically hidden as
"inline function that is empty if disabled" - which is how we tend to
do things in the kernel in general.

So it's not even "200+ lines of objectionable code". No. It's all fairly small.

Arguably much of the strangeness comes from "it can be enabled /
disabled both statically and dynamically".

The "dynamically disable" code may look particularly odd because some
of that is the "scx_switched_all()" thing that basically disables some
of the CFS special cases. Is that pretty? No, but in most cases I'd
say that the cause of said issue was that the CFS rules were
hardcoded.

So it's pretty small, it's pretty self-contained, and it only affects
one single subsystem.

See where I'm coming from?

> Aside of that you are completely ignoring my point.
>
> Collaborative integration is the right thing to do no matter what.

Yes. And am willing to say "ok, if three more months make this more
amicable, I'll delay merging for another release".

Amicable resolutions are obviously preferred. And I am certainly
willing to going back and telling people "ok, it's going to be 6.12,
not 6.11". Some people are going to be disappointed in me. That's my
job, and I don't think this needs to be *rushed*.

But really, in the very next sentence of "I don't think this needs to
be *rushed*", I do want to re-iterate that I feel like this pointless
delay this has seen seems unnecessary.

I hope I have explained above why I think this whole thing has not
been worth the brouhaha and pain that it has caused. It's really not
that big, and the issues I see have either had small clean trivial
solutions, or have very much looked to me like "ok, it's not
integrated, but I see why an external patch would try to do it that
way".

And yes, I may be missing some big ugly point. But I really _have_
been going through those roughly 200 lines of diff that actually seem
to be relevant to this whole argument. Multiple times.

So if I don't see it, please point it out very very explicitly, and
using small words to make me understand.

                Linus