Thomas, On Sun, Jun 23, 2024 at 10:14:55AM +0200, Thomas Gleixner wrote: > Chris! > > On Fri, Jun 21 2024 at 17:14, Chris Mason wrote: > > On 6/21/24 6:46 AM, Thomas Gleixner wrote: > > I'll be honest, the only clear and consistent communication we've gotten > > about sched_ext was "no, please go away". You certainly did engage with > > face to face discussions, but at the end of the day/week/month the > > overall message didn't change. > > The only time _I_ really told you "go away" was at OSPM 2023 when you > approached everyone in the worst possible way. I surely did not even say > "please" back then. Respectfully, you have seriously misrepresented the facts of how everything has played out over the last 18 months of this feature being proposed upstream. I think it's important to set the record straight here, so let's get concrete and look at the timeline for sched_ext's progress since its inception. 1. We actually first met at LPC in 2022, at which time I mentioned this project to you and asked if you had any thoughts about it. I understood your response to be, essentially, , "I don't like BPF because it introduces UAPI constraints on the entire kernel." My response was that struct_ops programs and kfuncs aren't beholden to UAPI constraints as they're exclusively kernel APIs. This was a random off-the-cuff discussion, and we had never met before, so I don't fault you for not engaging on this specific point in future discussions. I do, however, think it's important to highlight that we've been trying to engage with you on this in one form or another for about 18 months at this point. 2. The initial RFC [0] is sent out in late 2022. Peter to his credit did leave quite a few constructive comments that result in material improvements to the patch set, but the main takeaway was the thread in [1], which essentially boiled down to, "Hard no to this whole patch set, stop trying to push BPF into more places and ship random debug patches." [0]: https://lore.kernel.org/all/20221130082313.3241517-1-tj@xxxxxxxxxx/ [1]: https://lore.kernel.org/all/Y5b++AttvjzyTTJV@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ 3. v2 [2] of the patch set is sent to the list in late January 2023. Nobody from the scheduler community responds. [2]: https://lore.kernel.org/bpf/20230128001639.3510083-1-tj@xxxxxxxxxx/ 4. v3 [3] is sent in mid March. Again, no engagement from the scheduler community. [3]: https://lore.kernel.org/all/20230317213333.2174969-1-tj@xxxxxxxxxx/ 5. Peter sends this [4] email in a totally separate thread (which you were cc'd on, and we were not) 2 weeks before OSPM: [4]: https://lore.kernel.org/lkml/20230406103841.GJ386572@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > I'm hoping inter-process UMCG can be used to implement custom libpthread that > would allow running most of userspace under a custom UMCG scheduler and > obviate the need for this horrible piece of shit eBPF sched thing. Ok, it's LKML and people say things, but let's be honest here. You're accusing _us_ of being malicious and not collaborating with anybody, while you guys are not only ignoring our patch sets, but also slinging ad-hominems about the project in completely unrelated threads without even cc'ing us. 6. Also before OSPM, Peter sends the initial EEVDF patch set [5]. I replied [6] (again, before OSPM) explaining that it severely regresses one of our largest and most critical workloads, and got no response from Peter. Anecdotally, a lot of people have complained to me in private discussions at various conferences that they're very unhappy about EEVDF, and have asked me to chime in and say something to prevent it from being merged. I decided not to, both because I didn't want to waste my time (Peter didn't respond when I engaged before), and because I thought it would be in poor taste to publicly try and derail Peter's project while at the same time expecting him to engage with us on a project that I knew he wasn't happy about. [5]: https://lore.kernel.org/lkml/20230328092622.062917921@xxxxxxxxxxxxx/ [6]: https://lore.kernel.org/lkml/20230410031350.GA49280@maniforge/ Anyways, with all that in mind, when you say this: > The message people (not only me) perceived was: > > "The scheduler sucks, sched_ext solves the problems, saves us millions > and Google is happy to work with us [after dropping upstream scheduler > development a decade ago and leaving the opens for others to mop up]." > > followed by: > > "You should take it, as it will bring in fresh people to work on the > scheduler due to the lower entry barrier [because kernel hacking sucks]. > This will result in great new ideas which will be contributed back to > the scheduler proper." > > That was a really brilliant marketing stunt and I told you so very bluntly. > > It was presumably not your intention, but that's the problem of > communication between people. Though I haven't seen an useful attempt to > cure that. It's really hard to assume good intent here. If we raise issues on the list, we're ignored (yes, ignored, not oops it fell through my filter). If we bring them up in in-person discussions, we're made out to be complainers who have no intention of contributing anything. My perception from having attended these conferences is that there's literally nothing we can say that will make you guys happy (beyond meekly agreeing to just scrap everything). On that note, let's talk about OSPM: 7. Chris and I presented [7] at OSPM in mid-April. There was definitely some useful discussion that took place, but the basic message we received from both you and Peter was again, "No, go away." My recollection is that we were essentially given the same non-technical talking point of, "People will use this and it will fragment the scheduler space". By the way, Peter also applied this "people shouldn't use it" perspective to the debugfs knobs like migration_cost_ns, which we've also used to optimize our workloads by several %. I'm not sure exactly what we're supposed to do with the feedback of, (paraphrasing) "Just like the debugfs knobs, people will use this thing even though they shouldn't be." From my perspective, we were just being told to not do something that gives us massive performance wins, with no other suggestions or attempts at collaboration. And no, I wouldn't count your suggestion of "have user space send more information to the kernel scheduler" as something we could practically use or apply. More on that specific point below. [7]: https://www.youtube.com/watch?v=w5bLDpVol_M Now, I do agree with you that in general I could have delivered our message a bit better. Yes, I did try to make the argument that sched_ext would benefit fair.c (which I really do still believe), but Juri also gave me feedback afterwards that my talk probably would have gone over better if I'd first submitted patches to fair.c to show some good will. Fair enough. I was still very new to the Linux kernel world at that point, so I'll take the blame for not really understanding how things are done. Mea culpa. That said, I did attempt to submit a patch set that applies a lesson we learned from sched_ext to fair.c, but ultimately got nowhere with it (more on that below). > After that clash, the room got into a lively technical discussion about the > real underlying problem, i.e. that a big part of scheduling issues comes > from the fact, that there is not enough information about the requirements > and properties of an application available. Even you agreed with that, if I > remember correctly. Well, we agreed with you that it might be a good path forward, but you were also proposing a hand-wavey idea with nothing concrete and no code behind it. It felt like a deflection at the time, and it feels like one now. Also, let's point out that you apparently didn't feel the need to say anything about how applications should tell the scheduler about their needs for EEVDF. I'd love to know why it wasn't relevant then (again, just two weeks before OSPM). > sched_ext does not solve that problem. It just works around it by putting > the requirements and properties of an application into the BPF scheduler > and the user space portion of it. That works well in a controlled > environment like yours, but it does not even remotely help to solve the > underlying general problems. You acknowledged that and told: But we don't > have it today, though sched_ext is ready and will help with that. > > The concern that sched_ext will reduce the incentive to work on the > scheduler proper is not completely unfounded and I've yet to see the > slightest evidence which proves the contrary. I think there is a ton of evidence which proves the contrary (XDP, FUSE, etc), but given that Linus already covered this I don’t think we need to repeat ourselves. Anyways, let's continue going over the timeline. 9. An RFC [8] for the shared wakequeue (later called shared runqueue) patches is sent in June 2023. This patch set was based on experiments conducted in sched_ext, and I decided it was important to prioritize this based on the feedback I was given at OSPM. Peter gave a lot of helpful feedback on this patch set. [8]: https://lore.kernel.org/lkml/20230613052004.2836135-1-void@xxxxxxxxxxxxx/ 10. v2 [9] of the SHARED_RUNQ patch set is sent in July 2023. Peter again gives a lot of useful feedback. The environment in general feels very productive and collaborative, but the patch set isn't quite ready yet. [9]: https://lore.kernel.org/lkml/20230710200342.358255-1-void@xxxxxxxxxxxxx/ 9. On the same day as the SHARED_RUNQ patches, v4 [10] of the sched_ext patch set is sent. After two weeks of silence, Peter decides to respond [11] to this one with an official NAK, again with no technical or actionable feedback. [10]: https://lore.kernel.org/lkml/20230711011412.100319-1-tj@xxxxxxxxxx/ [11]: https://lore.kernel.org/lkml/20230726091752.GA3802077@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ 10. SHARED_RUNQ v3 [12] is sent in early August. No response from Peter, despite requesting his input on one or two of the patches. This is an example of why contributing to the core scheduler is such a pain. I spent at least 3-4 weeks of time on this patch set, and it ended up going nowhere, partly (but not entirely) due to Peter disappearing. Frankly, it seems like it got even more scrutiny than EEVDF did. Eventually, EEVDF ended up causing the feature to not work as well on hackbench [13], so I stopped bothering. [12]: https://lore.kernel.org/lkml/20230809221218.163894-1-void@xxxxxxxxxxxxx/ [13]: https://lore.kernel.org/lkml/20231212003141.216236-1-void@xxxxxxxxxxxxx/ 11. I presented on sched_ext at Kernel Recipes [14] in September 2023, which you attended. In a side-channel conversation that you and I had, you reiterated your point that you thought we were pushing the completely wrong message by saying that we think this will help fair.c, and made the request that we do more to make it clear that there won't be a maintenance burden on scheduler maintainers and distros. In particular, you asked me to make it more obvious when a sched_ext scheduler is loaded in the event of a system issue so that scheduler maintainers and distros could ignore any bug reports that come in for those scenarios. If we did that, you said, you would work with Peter on coming up with an amicable solution that left everybody happy, and you would chime in on the list (just like you said you would to Tejun at the Maintainer's Summit) to make forward progress. [14]: https://kernel-recipes.org/en/2023/schedule/sched_ext-pluggable-scheduling-in-the-linux-kernel/ After Kernel Recipes, I implemented your request in [15] (see [16] for the patch in the latest patch set). So your claim that we never did anything to meet you guys half way on anything is not true. Not only did we actually implement one of your requests, but our one request to you (to chime in on the list), you never did. You've said in other threads that you didn't have cycles for 7 months. Ok, it happens and ultimately we’re all volunteers when it comes to upstream work, but frankly I find it very hard to believe that you had literally no time in a 7 month window to review the patch set. Hearing you say that, while also at the same time trying to accuse us of being non-collaborative and malicious, feels a bit hypocritical to say the least. [15]: https://github.com/sched-ext/sched_ext/pull/66 [16]: https://lore.kernel.org/bpf/20240618212056.2833381-15-tj@xxxxxxxxxx/ 12. v5 [17] of the series is sent in early November 2023. Once again, we get no feedback from anyone in the scheduler community. [17]: https://lore.kernel.org/bpf/20231111024835.2164816-1-tj@xxxxxxxxxx/ 13. Maintainers Summit 2023 [18] happens. You, Tejun, and Alexei discussed the current situation with sched_ext. You raise some issues you have with integration, and you agree to bring the discussion to the list, which as we all know at this point, never happened. You also request that we fix the cgroup hierarchical scheduling mess, even though our only involvement was updating the existing CPU controller to use cgroup v2 APIs. This was proposed as a trade for you talking to Peter and letting sched_ext go in. While we didn’t feel great about a quid-pro-quo for getting sched_ext merged, we agreed to discuss it with Google and get back to you. [18]: https://lwn.net/Articles/951847/ 14. After maybe 6-ish weeks, we aligned with Google about dedicating resources to fixing the cgroup hierarchical scheduling mess, purely as a token of goodwill to you and Peter. At this point we started trying to send you private pings to coordinate (obviously we weren’t about to sink a ton of time into it without circling back with you first). We sent you several private pings between this point and when v6 landed, with no responses. > Don't tell me that this is impossible because sched_ext is not yet > upstream. It's used in production successfully as you said, so there > clearly must be something to learn from which could be shared at least in > form of data. OSPM24 would have been a great place for that especially as > the requirements and properties discussion was continued there with a plan. 15. You can't be serious. Firstly, sched_ext was discussed at OSPM 2024. Andrea Righi presented [19] on scx_rustland. I wasn't able to attend because I had other obligations, but it was certainly discussed. Also, if you had replied to any of our private pings and asked to meet at OSPM 2024, we could have absolutely made time for it. [19]: https://www.youtube.com/watch?v=HQRHo8E_4Ks But regardless, let's take a moment to reflect on what you're trying to claim here and in your other emails about our supposed lack of collaboration. You’re saying we should have used sched_ext to help solve the underlying problems in the scheduler, and that it was a mistake to not attend OSPM 2024? Here's a list of what we have done over the last 18 months: - I've been attending Steven Rostedt's monthly scheduler meeting regularly, and have discussed sched_ext at length with many people there; such as Juri, Daniel Bristot de Oliveira, Joel Fernandes, Steven, and Youssef Esmat. We’ve also discussed EEVDF, and possible improvements that could be implemented following fruitful sched_ext experiments. - I've attended and presented at a multitude of other conferences, including OSPM 2023, LSFMM (multiple years), KR, and LPC - Tejun attended the Maintainers Summit in 2023 - We cc'd Peter on every single patch set - We sent you many private emails Yet, you’re trying to claim that we should have attended OSPM 2024 and shared some data that could have been used to improve the scheduler, and because we didn’t, _we’re_ the ones who don’t want to collaborate? Sorry, but any perceived lack of data sharing on our part is 100% due to your guys’ lack of effort or desire to interact with us in literally any medium, or at any location. It really feels like you just picked OSPM 2024 because you realized it was the one conference that neither Tejun nor I could attend. For the record, you didn’t reach out to either of us to discuss meeting there. I would have made it work if it was important to you guys. Well, come to think of it, you hadn’t communicated with us in literally any capacity until Linus agreed to take the series in this patch set, so I guess it goes without saying that you didn’t ping us for OSPM 2024. 16. We send out v6 [20], and get public support for the project from two distros, Valve, Google, ChromeOS, etc. Linus decides that it's time to merge the project, and now all of a sudden you come out of the woodwork and start slinging mud and accusing us of not collaborating. And here we are now. [20]: https://lore.kernel.org/bpf/20240501151312.635565-1-tj@xxxxxxxxxx/ QED. _That’s_ what's actually happened over the last 18 months. We've made repeated attempts to collaborate, even going so far as agreeing to your private request that we fix the cgroup hierarchical mess, in a desperate bid to try and somehow make you guys happy and enable us to work collaboratively. Yet only now do you join the conversation, after countless private pings and private agreements that you didn't honor, once Linus _forced your hand_, only to accuse us of being unwilling to cooperate? If I sound indignant, it’s because I am. You guys made the decision to approach every single conversation with the singular purpose of trying to get the project derailed. Fine, I understand that you don't like it, and that you probably wouldn't have implemented pluggable scheduling with BPF if you had a choice. But to now come in at the 11th hour and try to blame _us_ for not collaborating with you, when it was you who ignored emails, slung mud, and failed to honor spoken agreements, is pretty brazen. All of that said, we of course remain committed to all the things we've said about working together with the community upstream. I actually totally agree with you that it would be a good idea to clean up the integration points. As we've said before, we didn't do that originally because we were trying to have as small of a footprint as possible in code the that you guys would have to deal with (which by the way was also in line with the feedback you gave me at KR). But no worries, now that the record is cleared, we’re happy to move forward and work with you. It’s been our goal the entire time. > At all other occasions, I sat down with people and discussed at a technical > level, but also clearly asked to resolve the social rift which all of this > created. As mentioned above, this was discussed in person, but you never met us half way. There's only so much we can do if you choose to ignore all of our private email pings and ghost us for 7 months (actually closer to 10 months if you count our discussion at KR 2023). Chris responded to the rest of your email, so I'll cut my already excessively long reply here. The one last thing that I do want to say that I really hope we can eventually put this ugliness behind us. I admire how you think about and approach software engineering, and I would love for your input on how we can do things better. I'm sorry that this reply had to be so serious and accusatory, but you forced our hand by approaching this entire conversation this way, and by being blatantly dishonest about our private discussions and private efforts to reach out to you to collaborate. Hopefully we can have beers in Vienna, and move on. Thanks, David
Attachment:
signature.asc
Description: PGP signature