Hello, On Wed, Dec 14, 2022 at 09:55:38AM +0100, Peter Zijlstra wrote: > On Tue, Dec 13, 2022 at 06:11:38PM -0800, Josh Don wrote: > > Improving scheduling performance requires rapid iteration to explore > > new policies and tune parameters, especially as hardware becomes more > > heterogeneous, and applications become more complex. Waiting months > > between evaluating scheduler policy changes is simply not scalable, > > but this is the reality with large fleets that require time for > > testing, qualification, and progressive rollout. The security angle > > should be clear from how involved it was to integrate core scheduling, > > for example. > > Surely you can evaluate stuff on a small subset of machines -- I'm > fairly sure I've had google and facebook people tell me they do just > that, roll out the test kernel on tens to hundreds of thousand of > machines instead of the stupid number and see how it behaves there. > > Statistics has something here I think, you can get a reliable > representation of stuff without having to sample *everyone*. Google guys probably have a lot to say here too and there may be many commonalties, but here's how things are on our end. We (Meta) experiment and debug at multiple levels. For example, when qualifying a new kernel or feature, a common pattern we follow is two-phased. The first phase is testing it on several well-known and widely used workloads in a controlled experiment environment with fewer number of machines, usually some dozens but can go one or two orders of magnitude higher. Once that looks okay, the second phase is to gradually deploy while monitoring system-level behaviors (crashes, utilization, latency and pressure metrics and so on) and feedbacks from service owners. We run tens of thousands of different workloads in the fleet and we try hard to do as much as possible in the first phase but many of the difficult and subtle problems are only detectable in the second phase. When we detect such problems in the second phase, we triage the problem and pull back deployment if necessary and then restart after fixing. As the overused saying goes, quantity has a quality of its own. The workloads become largely opaque because there are so many of them doing so many different things for anyone from system side to examine each of them. In many cases, the best and sometimes only visibility we get is statistical - comparing two chunks of the fleet which are large enough for the statistical signals to overcome the noise. That threshold can be pretty high. Multiple hundreds of thousands of machines being used for a test set isn't all that uncommon. One complicating factor for the second phase is that we're deploying on production fleet running live production workloads. Besides the obvious fact that users become mightily unhappy when machines crash, there are complicating matters like limits on how many and which machines can be rebooted at any given time due to interactions with capacity and maintenance which severely restricts how fast kernels can be iterated. A full sweep through the fleet can easily take months. Between a large number of opaque workloads and production constraints which limit the type and speed of kernel iterations, our ability to experiment with scheduling by modifying the kernel directly is severely limited. We can do small things but trying out big ideas can become logistically prohibitive. Note that all these get even worse for public cloud operators. If we really need to, we can at least find the service owner and talk with them. For public cloud operators, the workloads are truly opaque. There's yet another aspect which is caused by fleet dynamism. When we're hunting down a scheduling misbehavior and want to test out specific ideas, it can actually be pretty difficult to get back the same workload composition after a reboot or crash. The fleet management layer will kick in right away and the workloads get reallocated who-knows-where. This problem is likely shared by smaller scale operations too. There are just a lot of layers which are difficult to fixate across reboots and crashes. Even in the same workload, the load balancer or dispatcher might behave very differently for the machine after a reboot. > I was given to believe this was a fairly rapid process. Going back to the first phase where we're experimenting in a more controlled environment. Yes, that is a faster process but only in comparison to the second phase. Some controlled experiments, the faster ones, usually take several hours to obtain a meaningful result. It just takes a while for production workloads to start, jit-compile all the hot code paths, warm up caches and so on. Others, unfortunately, take a lot longer to ramp up to the degree whether it can be compared against production numbers. Some of the benchmarks stretch multiple days. With SCX, we can keep just keep hotswapping and tuning the scheduler behavior getting results in tens of minutes instead of multiple hours and without worrying about crashing the test machines, which often have side-effects on the benchmark setup - the benchmarks are often performed with shadowed production traffic using the same production software and they get unhappy when a lot of machines crash. These problems can easily take hours to resolve. > Just because you guys have more machines than is reasonable, doesn't > mean we have to put BPF everywhere. There are some problems which are specific to large operators like us or google for sure, but many of these problems are shared by other use cases which need to test with real-world applications. Even on mobile devices, it's way easier and faster to have a running test environment setup and iterate through scheduling behavior changes without worrying about crashing the machine than having to cycle and re-setup test setup for each iteration. The productivity gain extends to individual kernel developers and researchers. Just rebooting a server class hardware often takes upwards of ten minutes, so most of us try to iterate as much on VMs as possible which unfortunately doesn't work out too well for subtle performance issues. SCX can easily cut down iteration time by an order of magnitude or more. > Additionally, we don't merge and ship everybodies random debug patch > either -- you're free to do whatever you need to iterate on your own and > then send the patches that result from this experiment upstream. This is > how development works, no? We of course don't merge random debug patches which have limited usefulness to a small number of use cases. However, we absolutely do ship code to support debugging and development when the benefit outweights the cost, just to list several examples - lockdep, perf, tracing, all the memory debug options. The argument is that given the current situation including hardware and software landscape, the benefit of having BPF extensible scheduling framework has enough benefits to justify the cost. Thanks. -- tejun