Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

Vineeth Remanan Pillai <vineeth@xxxxxxxxxxxxxxx> · Thu, 14 Dec 2023 16:36:39 -0500

On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > Now when I think about it, the implementation seems to
> > suggest that we are putting policies in kvm. Ideally, the goal is:
> > - guest scheduler communicates the priority requirements of the workload
> > - kvm applies the priority to the vcpu task.
>
> Why?  Tasks are tasks, why does KVM need to get involved?  E.g. if the problem
> is that userspace doesn't have the right knobs to adjust the priority of a task
> quickly and efficiently, then wouldn't it be better to solve that problem in a
> generic way?
>
I get your point. A generic way would have been more preferable, but I
feel the scenario we are tackling is a bit more time critical and kvm
is better equipped to handle this. kvm has control over the VM/vcpu
execution and hence it can take action in the most effective way.

One example is the place where we handle boost/unboost. By the time
you come out of kvm to userspace it would be too late. Currently we
apply the boost soon after VMEXIT before enabling preemption so that
the next scheduler entry will consider the boosted priority. As soon
as you enable preemption, the vcpu could be preempted and boosting
would not help when it is boosted. This timing correctness is very
difficult to achieve if we try to do it in userland or do it
out-of-band.

[...snip...]
> > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > tailor made for the problems you are trying to solve.
> > >
> > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> > >
> > You are right, sched_ext is a good choice to have policies
> > implemented. In our case, we would need a communication mechanism as
> > well and hence we thought kvm would work best to be a medium between
> > the guest and the host.
>
> Making KVM be the medium may be convenient and the quickest way to get a PoC
> out the door, but effectively making KVM a middle-man is going to be a huge net
> negative in the long term.  Userspace can communicate with the guest just as
> easily as KVM, and if you make KVM the middle-man, then you effectively *must*
> define a relatively rigid guest/host ABI.
>
> If instead the contract is between host userspace and the guest, the ABI can be
> much more fluid, e.g. if you (or any setup) can control at least some amount of
> code that runs in the guest, then the contract between the guest and host doesn't
> even need to be formally defined, it could simply be a matter of bundling host
> and guest code appropriately.
>
> If you want to land support for a given contract in upstream repositories, e.g.
> to broadly enable paravirt scheduling support across a variety of usersepace VMMs
> and/or guests, then yeah, you'll need a formal ABI.  But that's still not a good
> reason to have KVM define the ABI.  Doing it in KVM might be a wee bit easier because
> it's largely just a matter of writing code, and LKML provides a centralized channel
> for getting buyin from all parties.  But defining an ABI that's independent of the
> kernel is absolutely doable, e.g. see the many virtio specs.
>
> I'm not saying KVM can't help, e.g. if there is information that is known only
> to KVM, but the vast majority of the contract doesn't need to be defined by KVM.
>
As you mentioned, custom contract between guest and host userspace is
really flexible, but I believe tackling scheduling(especially latency)
issues is a bit more difficult with generic approaches. Here kvm does
have some information known only to kvm(which could be shared - eg:
interrupt injection) but more importantly kvm has some unique
capabilities when it comes to scheduling. kvm and scheduler are
cooperating currently for various cases like, steal time accounting,
vcpu preemption state, spinlock handling etc. We could possibly try to
extend it a little further in a non-intrusive way.

Having a formal paravirt scheduling ABI is something we would want to
pursue (as I mentioned in the cover letter) and this could help not
only with latencies, but optimal task placement for efficiency, power
utilization etc. kvm's role could be to set the stage and share
information with minimum delay and less resource overhead. We could
use schedulers (vanilla, sched_ext, ...) to actually make decisions
based on the information it receives.

Thanks for all your valuable inputs and I understand that a formal ABI
is needed for the above interface. We shall look more into the
feasibility and efforts for this.

Thanks,
Vineeth