Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

David Vernet <void@xxxxxxxxxxxxx> · Mon, 24 Jun 2024 17:17:21 -0500

Thomas,

On Sun, Jun 23, 2024 at 10:14:55AM +0200, Thomas Gleixner wrote:
> Chris!
> 
> On Fri, Jun 21 2024 at 17:14, Chris Mason wrote:
> > On 6/21/24 6:46 AM, Thomas Gleixner wrote:
> > I'll be honest, the only clear and consistent communication we've gotten
> > about sched_ext was "no, please go away".  You certainly did engage with
> > face to face discussions, but at the end of the day/week/month the
> > overall message didn't change.
> 
> The only time _I_ really told you "go away" was at OSPM 2023 when you
> approached everyone in the worst possible way. I surely did not even say
> "please" back then.

Respectfully, you have seriously misrepresented the facts of how everything has
played out over the last 18 months of this feature being proposed upstream.  I
think it's important to set the record straight here, so let's get concrete and
look at the timeline for sched_ext's progress since its inception.

1. We actually first met at LPC in 2022, at which time I mentioned this project
to you and asked if you had any thoughts about it. I understood your response
to be, essentially, , "I don't like BPF because it introduces UAPI constraints
on the entire kernel." My response was that struct_ops programs and kfuncs
aren't beholden to UAPI constraints as they're exclusively kernel APIs.

This was a random off-the-cuff discussion, and we had never met before, so I
don't fault you for not engaging on this specific point in future discussions.
I do, however, think it's important to highlight that we've been trying to
engage with you on this in one form or another for about 18 months at this
point.

2. The initial RFC [0] is sent out in late 2022. Peter to his credit did leave
quite a few constructive comments that result in material improvements to the
patch set, but the main takeaway was the thread in [1], which essentially
boiled down to, "Hard no to this whole patch set, stop trying to push BPF into
more places and ship random debug patches."

[0]: https://lore.kernel.org/all/20221130082313.3241517-1-tj@xxxxxxxxxx/
[1]: https://lore.kernel.org/all/Y5b++AttvjzyTTJV@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

3. v2 [2] of the patch set is sent to the list in late January 2023. Nobody
from the scheduler community responds.

[2]: https://lore.kernel.org/bpf/20230128001639.3510083-1-tj@xxxxxxxxxx/

4. v3 [3] is sent in mid March. Again, no engagement from the scheduler
community.

[3]: https://lore.kernel.org/all/20230317213333.2174969-1-tj@xxxxxxxxxx/

5. Peter sends this [4] email in a totally separate thread (which you were cc'd
on, and we were not) 2 weeks before OSPM:

[4]: https://lore.kernel.org/lkml/20230406103841.GJ386572@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

> I'm hoping inter-process UMCG can be used to implement custom libpthread that
> would allow running most of userspace under a custom UMCG scheduler and
> obviate the need for this horrible piece of shit eBPF sched thing.

Ok, it's LKML and people say things, but let's be honest here.  You're accusing
_us_ of being malicious and not collaborating with anybody, while you guys are
not only ignoring our patch sets, but also slinging ad-hominems about the
project in completely unrelated threads without even cc'ing us.

6. Also before OSPM, Peter sends the initial EEVDF patch set [5]. I replied [6]
(again, before OSPM) explaining that it severely regresses one of our largest
and most critical workloads, and got no response from Peter.

Anecdotally, a lot of people have complained to me in private discussions at
various conferences that they're very unhappy about EEVDF, and have asked me to
chime in and say something to prevent it from being merged. I decided not to,
both because I didn't want to waste my time (Peter didn't respond when I
engaged before), and because I thought it would be in poor taste to publicly
try and derail Peter's project while at the same time expecting him to engage
with us on a project that I knew he wasn't happy about.

[5]: https://lore.kernel.org/lkml/20230328092622.062917921@xxxxxxxxxxxxx/
[6]: https://lore.kernel.org/lkml/20230410031350.GA49280@maniforge/

Anyways, with all that in mind, when you say this:

> The message people (not only me) perceived was:
> 
>   "The scheduler sucks, sched_ext solves the problems, saves us millions
>    and Google is happy to work with us [after dropping upstream scheduler
>    development a decade ago and leaving the opens for others to mop up]."
> 
> followed by:
> 
>   "You should take it, as it will bring in fresh people to work on the
>    scheduler due to the lower entry barrier [because kernel hacking sucks].
>    This will result in great new ideas which will be contributed back to
>    the scheduler proper."
> 
> That was a really brilliant marketing stunt and I told you so very bluntly.
> 
> It was presumably not your intention, but that's the problem of
> communication between people. Though I haven't seen an useful attempt to
> cure that.

It's really hard to assume good intent here. If we raise issues on the list,
we're ignored (yes, ignored, not oops it fell through my filter). If we bring
them up in in-person discussions, we're made out to be complainers who have no
intention of contributing anything. My perception from having attended these
conferences is that there's literally nothing we can say that will make you
guys happy (beyond meekly agreeing to just scrap everything).

On that note, let's talk about OSPM:

7. Chris and I presented [7] at OSPM in mid-April. There was definitely some
useful discussion that took place, but the basic message we received from both
you and Peter was again, "No, go away." My recollection is that we were
essentially given the same non-technical talking point of, "People will use
this and it will fragment the scheduler space". By the way, Peter also applied
this "people shouldn't use it" perspective to the debugfs knobs like
migration_cost_ns, which we've also used to optimize our workloads by several %.

I'm not sure exactly what we're supposed to do with the feedback of,
(paraphrasing) "Just like the debugfs knobs, people will use this thing even
though they shouldn't be." From my perspective, we were just being told to not
do something that gives us massive performance wins, with no other suggestions
or attempts at collaboration. And no, I wouldn't count your suggestion of "have
user space send more information to the kernel scheduler" as something we could
practically use or apply. More on that specific point below.

[7]: https://www.youtube.com/watch?v=w5bLDpVol_M

Now, I do agree with you that in general I could have delivered our message a
bit better. Yes, I did try to make the argument that sched_ext would benefit
fair.c (which I really do still believe), but Juri also gave me feedback
afterwards that my talk probably would have gone over better if I'd first
submitted patches to fair.c to show some good will. Fair enough. I was still
very new to the Linux kernel world at that point, so I'll take the blame for
not really understanding how things are done. Mea culpa. That said, I did
attempt to submit a patch set that applies a lesson we learned from sched_ext
to fair.c, but ultimately got nowhere with it (more on that below).

> After that clash, the room got into a lively technical discussion about the
> real underlying problem, i.e. that a big part of scheduling issues comes
> from the fact, that there is not enough information about the requirements
> and properties of an application available. Even you agreed with that, if I
> remember correctly.

Well, we agreed with you that it might be a good path forward, but you were
also proposing a hand-wavey idea with nothing concrete and no code behind it.
It felt like a deflection at the time, and it feels like one now. Also, let's
point out that you apparently didn't feel the need to say anything about how
applications should tell the scheduler about their needs for EEVDF. I'd love to
know why it wasn't relevant then (again, just two weeks before OSPM).

> sched_ext does not solve that problem. It just works around it by putting
> the requirements and properties of an application into the BPF scheduler
> and the user space portion of it. That works well in a controlled
> environment like yours, but it does not even remotely help to solve the
> underlying general problems. You acknowledged that and told: But we don't
> have it today, though sched_ext is ready and will help with that.
> 
> The concern that sched_ext will reduce the incentive to work on the
> scheduler proper is not completely unfounded and I've yet to see the
> slightest evidence which proves the contrary.

I think there is a ton of evidence which proves the contrary (XDP, FUSE, etc),
but given that Linus already covered this I don’t think we need to repeat
ourselves.

Anyways, let's continue going over the timeline.

9. An RFC [8] for the shared wakequeue (later called shared runqueue) patches
is sent in June 2023. This patch set was based on experiments conducted in
sched_ext, and I decided it was important to prioritize this based on the
feedback I was given at OSPM. Peter gave a lot of helpful feedback on this
patch set.

[8]: https://lore.kernel.org/lkml/20230613052004.2836135-1-void@xxxxxxxxxxxxx/

10. v2 [9] of the SHARED_RUNQ patch set is sent in July 2023. Peter again gives
a lot of useful feedback. The environment in general feels very productive and
collaborative, but the patch set isn't quite ready yet.

[9]: https://lore.kernel.org/lkml/20230710200342.358255-1-void@xxxxxxxxxxxxx/

9. On the same day as the SHARED_RUNQ patches, v4 [10] of the sched_ext patch
set is sent. After two weeks of silence, Peter decides to respond [11] to this
one with an official NAK, again with no technical or actionable feedback.

[10]: https://lore.kernel.org/lkml/20230711011412.100319-1-tj@xxxxxxxxxx/
[11]: https://lore.kernel.org/lkml/20230726091752.GA3802077@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

10. SHARED_RUNQ v3 [12] is sent in early August. No response from Peter, despite
requesting his input on one or two of the patches. This is an example of why
contributing to the core scheduler is such a pain. I spent at least 3-4 weeks
of time on this patch set, and it ended up going nowhere, partly (but not
entirely) due to Peter disappearing. Frankly, it seems like it got even more
scrutiny than EEVDF did. Eventually, EEVDF ended up causing the feature to not
work as well on hackbench [13], so I stopped bothering.

[12]: https://lore.kernel.org/lkml/20230809221218.163894-1-void@xxxxxxxxxxxxx/
[13]: https://lore.kernel.org/lkml/20231212003141.216236-1-void@xxxxxxxxxxxxx/

11. I presented on sched_ext at Kernel Recipes [14] in September 2023, which
you attended. In a side-channel conversation that you and I had, you reiterated
your point that you thought we were pushing the completely wrong message by
saying that we think this will help fair.c, and made the request that we do
more to make it clear that there won't be a maintenance burden on scheduler
maintainers and distros. In particular, you asked me to make it more obvious
when a sched_ext scheduler is loaded in the event of a system issue so that
scheduler maintainers and distros could ignore any bug reports that come in for
those scenarios. If we did that, you said, you would work with Peter on coming
up with an amicable solution that left everybody happy, and you would chime in
on the list (just like you said you would to Tejun at the Maintainer's Summit)
to make forward progress.

[14]: https://kernel-recipes.org/en/2023/schedule/sched_ext-pluggable-scheduling-in-the-linux-kernel/

After Kernel Recipes, I implemented your request in [15] (see [16] for the
patch in the latest patch set). So your claim that we never did anything to
meet you guys half way on anything is not true. Not only did we actually
implement one of your requests, but our one request to you (to chime in on the
list), you never did.

You've said in other threads that you didn't have cycles for 7 months. Ok, it
happens and ultimately we’re all volunteers when it comes to upstream work, but
frankly I find it very hard to believe that you had literally no time in a 7
month window to review the patch set. Hearing you say that, while also at the
same time trying to accuse us of being non-collaborative and malicious, feels a
bit hypocritical to say the least.

[15]: https://github.com/sched-ext/sched_ext/pull/66
[16]: https://lore.kernel.org/bpf/20240618212056.2833381-15-tj@xxxxxxxxxx/

12. v5 [17] of the series is sent in early November 2023. Once again, we get no
feedback from anyone in the scheduler community.

[17]: https://lore.kernel.org/bpf/20231111024835.2164816-1-tj@xxxxxxxxxx/

13. Maintainers Summit 2023 [18] happens. You, Tejun, and Alexei discussed the
current situation with sched_ext. You raise some issues you have with
integration, and you agree to bring the discussion to the list, which as we all
know at this point, never happened. You also request that we fix the cgroup
hierarchical scheduling mess, even though our only involvement was updating the
existing CPU controller to use cgroup v2 APIs. This was proposed as a trade for
you talking to Peter and letting sched_ext go in. While we didn’t feel great
about a quid-pro-quo for getting sched_ext merged, we agreed to discuss it with
Google and get back to you.

[18]: https://lwn.net/Articles/951847/

14. After maybe 6-ish weeks, we aligned with Google about dedicating resources
to fixing the cgroup hierarchical scheduling mess, purely as a token of
goodwill to you and Peter. At this point we started trying to send you private
pings to coordinate (obviously we weren’t about to sink a ton of time into it
without circling back with you first). We sent you several private pings
between this point and when v6 landed, with no responses.

> Don't tell me that this is impossible because sched_ext is not yet
> upstream. It's used in production successfully as you said, so there
> clearly must be something to learn from which could be shared at least in
> form of data. OSPM24 would have been a great place for that especially as
> the requirements and properties discussion was continued there with a plan.

15. You can't be serious.

Firstly, sched_ext was discussed at OSPM 2024. Andrea Righi presented [19] on
scx_rustland. I wasn't able to attend because I had other obligations, but it
was certainly discussed. Also, if you had replied to any of our private pings
and asked to meet at OSPM 2024, we could have absolutely made time for it.

[19]: https://www.youtube.com/watch?v=HQRHo8E_4Ks

But regardless, let's take a moment to reflect on what you're trying to claim
here and in your other emails about our supposed lack of collaboration.  You’re
saying we should have used sched_ext to help solve the underlying problems in
the scheduler, and that it was a mistake to not attend OSPM 2024?  Here's a
list of what we have done over the last 18 months:

- I've been attending Steven Rostedt's monthly scheduler meeting regularly, and
  have discussed sched_ext at length with many people there; such as Juri,
  Daniel Bristot de Oliveira, Joel Fernandes, Steven, and Youssef Esmat. We’ve
  also discussed EEVDF, and possible improvements that could be implemented
  following fruitful sched_ext experiments.
- I've attended and presented at a multitude of other conferences, including
  OSPM 2023, LSFMM (multiple years), KR, and LPC
- Tejun attended the Maintainers Summit in 2023
- We cc'd Peter on every single patch set
- We sent you many private emails

Yet, you’re trying to claim that we should have attended OSPM 2024 and shared
some data that could have been used to improve the scheduler, and because we
didn’t, _we’re_ the ones who don’t want to collaborate? Sorry, but any
perceived lack of data sharing on our part is 100% due to your guys’ lack of
effort or desire to interact with us in literally any medium, or at any
location. It really feels like you just picked OSPM 2024 because you realized
it was the one conference that neither Tejun nor I could attend. For the
record, you didn’t reach out to either of us to discuss meeting there. I would
have made it work if it was important to you guys. Well, come to think of it,
you hadn’t communicated with us in literally any capacity until Linus agreed to
take the series in this patch set, so I guess it goes without saying that you
didn’t ping us for OSPM 2024.

16. We send out v6 [20], and get public support for the project from two
distros, Valve, Google, ChromeOS, etc. Linus decides that it's time to merge
the project, and now all of a sudden you come out of the woodwork and start
slinging mud and accusing us of not collaborating. And here we are now.

[20]: https://lore.kernel.org/bpf/20240501151312.635565-1-tj@xxxxxxxxxx/

QED. _That’s_ what's actually happened over the last 18 months. We've made
repeated attempts to collaborate, even going so far as agreeing to your private
request that we fix the cgroup hierarchical mess, in a desperate bid to try and
somehow make you guys happy and enable us to work collaboratively. Yet only now
do you join the conversation, after countless private pings and private
agreements that you didn't honor, once Linus _forced your hand_, only to accuse
us of being unwilling to cooperate?

If I sound indignant, it’s because I am. You guys made the decision to approach
every single conversation with the singular purpose of trying to get the
project derailed. Fine, I understand that you don't like it, and that you
probably wouldn't have implemented pluggable scheduling with BPF if you had a
choice. But to now come in at the 11th hour and try to blame _us_ for not
collaborating with you, when it was you who ignored emails, slung mud, and
failed to honor spoken agreements, is pretty brazen.

All of that said, we of course remain committed to all the things we've said
about working together with the community upstream. I actually totally agree
with you that it would be a good idea to clean up the integration points. As
we've said before, we didn't do that originally because we were trying to have
as small of a footprint as possible in code the that you guys would have to
deal with (which by the way was also in line with the feedback you gave me at
KR). But no worries, now that the record is cleared, we’re happy to move
forward and work with you. It’s been our goal the entire time.

> At all other occasions, I sat down with people and discussed at a technical
> level, but also clearly asked to resolve the social rift which all of this
> created.

As mentioned above, this was discussed in person, but you never met us half
way. There's only so much we can do if you choose to ignore all of our private
email pings and ghost us for 7 months (actually closer to 10 months if you
count our discussion at KR 2023).

Chris responded to the rest of your email, so I'll cut my already excessively
long reply here. The one last thing that I do want to say that I really hope we
can eventually put this ugliness behind us. I admire how you think about and
approach software engineering, and I would love for your input on how we can do
things better. I'm sorry that this reply had to be so serious and accusatory,
but you forced our hand by approaching this entire conversation this way, and
by being blatantly dishonest about our private discussions and private efforts
to reach out to you to collaborate. Hopefully we can have beers in Vienna, and
move on.

Thanks,
David
Attachment:
signature.asc

Description: PGP signature