On Fri, Dec 15, 2023, Vineeth Remanan Pillai wrote: > > > > > > > I get your point. A generic way would have been more preferable, but I > > > feel the scenario we are tackling is a bit more time critical and kvm > > > is better equipped to handle this. kvm has control over the VM/vcpu > > > execution and hence it can take action in the most effective way. > > > > No, KVM most definitely does not. Between sched, KVM, and userspace, I would > > rank KVM a very distant third. Userspace controls when to do KVM_RUN, to which > > cgroup(s) a vCPU task is assigned, the affinity of the task, etc. sched decides > > when and where to run a vCPU task based on input from userspace. > > > > Only in some edge cases that are largely unique to overcommitted CPUs does KVM > > have any input on scheduling whatsoever. And even then, KVM's view is largely > > limited to a single VM, e.g. teaching KVM to yield to a vCPU running in a different > > VM would be interesting, to say the least. > > > Over committed case is exactly what we are trying to tackle. Yes, I know. I was objecting to the assertion that "kvm has control over the VM/vcpu execution and hence it can take action in the most effective way". In overcommit use cases, KVM has some *influence*, and in non-overcommit use cases, KVM is essentially not in the picture at all. > Sorry for not making this clear in the cover letter. ChromeOS runs on low-end > devices (eg: 2C/2T cpus) and does not have enough compute capacity to > offload scheduling decisions. In-band scheduling decisions gave the > best results. > > > > One example is the place where we handle boost/unboost. By the time > > > you come out of kvm to userspace it would be too late. > > > > Making scheduling decisions in userspace doesn't require KVM to exit to userspace. > > It doesn't even need to require a VM-Exit to KVM. E.g. if the scheduler (whether > > it's in kernel or userspace) is running on a different logical CPU(s), then there's > > no need to trigger a VM-Exit because the scheduler can incorporate information > > about a vCPU in real time, and interrupt the vCPU if and only if something else > > needs to run on that associated CPU. From the sched_ext cover letter: > > > > : Google has also experimented with some promising, novel scheduling policies. > > : One example is “central” scheduling, wherein a single CPU makes all > > : scheduling decisions for the entire system. This allows most cores on the > > : system to be fully dedicated to running workloads, and can have significant > > : performance improvements for certain use cases. For example, central > > : scheduling with VCPUs can avoid expensive vmexits and cache flushes, by > > : instead delegating the responsibility of preemption checks from the tick to > > : a single CPU. See scx_central.bpf.c for a simple example of a central > > : scheduling policy built in sched_ext. > > > This makes sense when the host has enough compute resources for > offloading scheduling decisions. Yeah, again, I know. The point I am trying to get across is that this RFC only benefits/handles one use case, and doesn't have line of sight to being extensible to other use cases. > > > As you mentioned, custom contract between guest and host userspace is > > > really flexible, but I believe tackling scheduling(especially latency) > > > issues is a bit more difficult with generic approaches. Here kvm does > > > have some information known only to kvm(which could be shared - eg: > > > interrupt injection) but more importantly kvm has some unique > > > capabilities when it comes to scheduling. kvm and scheduler are > > > cooperating currently for various cases like, steal time accounting, > > > vcpu preemption state, spinlock handling etc. We could possibly try to > > > extend it a little further in a non-intrusive way. > > > > I'm not too worried about the code being intrusive, I'm worried about the > > maintainability, longevity, and applicability of this approach. > > > > IMO, this has a significantly lower ceiling than what is possible with something > > like sched_ext, e.g. it requires a host tick to make scheduling decisions, and > > because it'd require a kernel-defined ABI, would essentially be limited to knobs > > that are broadly useful. I.e. every bit of information that you want to add to > > the guest/host ABI will need to get approval from at least the affected subsystems > > in the guest, from KVM, and possibly from the host scheduler too. That's going > > to make for a very high bar. > > > Just thinking out loud, The ABI could be very simple to start with. A > shared page with dedicated guest and host areas. Guest fills details > about its priority requirements, host fills details about the actions > it took(boost/unboost, priority/sched class etc). Passing this > information could be in-band or out-of-band. out-of-band could be used > by dedicated userland schedulers. If both guest and host agrees on > in-band during guest startup, kvm could hand over the data to > scheduler using a scheduler callback. I feel this small addition to > kvm could be maintainable and by leaving the protocol for interpreting > shared memory to guest and host, this would be very generic and cater > to multiple use cases. Something like above could be used both by > low-end devices and high-end server like systems and guest and host > could have custom protocols to interpret the data and make decisions. > > In this RFC, we have a miniature form of the above, where we have a > shared memory area and the scheduler callback is basically > sched_setscheduler. But it could be made very generic as part of ABI > design. For out-of-band schedulers, this call back could be setup by > sched_ext, a userland scheduler and any similar out-of-band scheduler. > > I agree, getting a consensus and approval is non-trivial. IMHO, this > use case is compelling for such an ABI because out-of-band schedulers > might not give the desired results for low-end devices. > > > > Having a formal paravirt scheduling ABI is something we would want to > > > pursue (as I mentioned in the cover letter) and this could help not > > > only with latencies, but optimal task placement for efficiency, power > > > utilization etc. kvm's role could be to set the stage and share > > > information with minimum delay and less resource overhead. > > > > Making KVM middle-man is most definitely not going to provide minimum delay or > > overhead. Minimum delay would be the guest directly communicating with the host > > scheduler. I get that convincing the sched folks to add a bunch of paravirt > > stuff is a tall order (for very good reason), but that's exactly why I Cc'd the > > sched_ext folks. > > > As mentioned above, guest directly talking to host scheduler without > involving kvm would mean an out-of-band scheduler and the > effectiveness depends on how fast the scheduler gets to run. No, the "host scheduler" could very well be a dedicated in-kernel paravirt scheduler. It could be a sched_ext BPF program that for all intents and purposes is in-band. You are basically proposing that KVM bounce-buffer data between guest and host. I'm saying there's no _technical_ reason to use a bounce-buffer, just do zero copy. > In lowend compute devices, that would pose a challenge. In such scenarios, kvm > seems to be a better option to provide minimum delay and cpu overhead.