Re: [RFC PATCH 0/6] Improve VM DVFS and task placement behavior

David Dai <davidai@xxxxxxxxxx> · Thu, 6 Apr 2023 14:39:07 -0700

On Thu, Apr 6, 2023 at 5:52 AM Quentin Perret <qperret@xxxxxxxxxx> wrote:
>
> On Wednesday 05 Apr 2023 at 14:07:18 (-0700), Saravana Kannan wrote:
> > On Wed, Apr 5, 2023 at 12:48 AM 'Quentin Perret' via kernel-team
> > > And I concur with all the above as well. Putting this in the kernel is
> > > not an obvious fit at all as that requires a number of assumptions about
> > > the VMM.
> > >
> > > As Oliver pointed out, the guest topology, and how it maps to the host
> > > topology (vcpu pinning etc) is very much a VMM policy decision and will
> > > be particularly important to handle guest frequency requests correctly.
> > >
> > > In addition to that, the VMM's software architecture may have an impact.
> > > Crosvm for example does device emulation in separate processes for
> > > security reasons, so it is likely that adjusting the scheduling
> > > parameters ('util_guest', uclamp, or else) only for the vCPU thread that
> > > issues frequency requests will be sub-optimal for performance, we may
> > > want to adjust those parameters for all the tasks that are on the
> > > critical path.
> > >
> > > And at an even higher level, assuming in the kernel a certain mapping of
> > > vCPU threads to host threads feels kinda wrong, this too is a host
> > > userspace policy decision I believe. Not that anybody in their right
> > > mind would want to do this, but I _think_ it would technically be
> > > feasible to serialize the execution of multiple vCPUs on the same host
> > > thread, at which point the util_guest thingy becomes entirely bogus. (I
> > > obviously don't want to conflate this use-case, it's just an example
> > > that shows the proposed abstraction in the series is not a perfect fit
> > > for the KVM userspace delegation model.)
> >
> > See my reply to Oliver and Marc. To me it looks like we are converging
> > towards having shared memory between guest, host kernel and VMM and
> > that should address all our concerns.
>
> Hmm, that is not at all my understanding of what has been the most
> important part of the feedback so far: this whole thing belongs to
> userspace.
>
> > The guest will see a MMIO device, writing to it will trigger the host
> > kernel to do the basic "set util_guest/uclamp for the vCPU thread that
> > corresponds to the vCPU" and then the VMM can do more on top as/if
> > needed (because it has access to the shared memory too). Does that
> > make sense?
>
> Not really no. I've given examples of why this doesn't make sense for
> the kernel to do this, which still seems to be the case with what you're
> suggesting here.
>
> > Even in the extreme example, the stuff the kernel would do would still
> > be helpful, but not sufficient. You can aggregate the
> > util_guest/uclamp and do whatever from the VMM.
> > Technically in the extreme example, you don't need any of this. The
> > normal util tracking of the vCPU thread on the host side would be
> > sufficient.
> >
> > Actually any time we have only 1 vCPU host thread per VM, we shouldn't
> > be using anything in this patch series and not instantiate the guest
> > device at all.
>
> > > So +1 from me to move this as a virtual device of some kind. And if the
> > > extra cost of exiting all the way back to userspace is prohibitive (is
> > > it btw?),
> >
> > I think the "13% increase in battery consumption for games" makes it
> > pretty clear that going to userspace is prohibitive. And that's just
> > one example.
>

Hi Quentin,

Appreciate the feedback,

> I beg to differ. We need to understand where these 13% come from in more
> details. Is it really the actual cost of the userspace exit? Or is it
> just that from userspace the only knob you can play with is uclamp and
> that didn't reach the expected level of performance?

To clarify, the MMIO numbers shown in the cover letter were collected
with updating vCPU task's util_guest as opposed to uclamp_min. In that
configuration, userspace(VMM) handles the mmio_exit from the guest and
makes an ioctl on the host kernel to update util_guest for the vCPU
task.

>
> If that is the userspace exit, then we can work to optimize that -- it's
> a fairly common problem in the virt world, nothing special here.
>

Ok, we're open to suggestions on how to better optimize here.

> And if the issue is the lack of expressiveness in uclamp, then that too
> is something we should work on, but clearly giving vCPU threads more
> 'power' than normal host threads is a bit of a red flag IMO. vCPU
> threads must be constrained in the same way that userspace threads are,
> because they _are_ userspace threads.
>
> > > then we can try to work on that. Maybe something a la vhost
> > > can be done to optimize, I'll have a think.
> > >
> > > > The one thing I'd like to understand that the comment seems to imply
> > > > that there is a significant difference in overhead between a hypercall
> > > > and an MMIO. In my experience, both are pretty similar in cost for a
> > > > handling location (both in userspace or both in the kernel). MMIO
> > > > handling is a tiny bit more expensive due to a guaranteed TLB miss
> > > > followed by a walk of the in-kernel device ranges, but that's all. It
> > > > should hardly register.
> > > >
> > > > And if you really want some super-low latency, low overhead
> > > > signalling, maybe an exception is the wrong tool for the job. Shared
> > > > memory communication could be more appropriate.
> > >
> > > I presume some kind of signalling mechanism will be necessary to
> > > synchronously update host scheduling parameters in response to guest
> > > frequency requests, but if the volume of data requires it then a shared
> > > buffer + doorbell type of approach should do.
> >
> > Part of the communication doesn't need synchronous handling by the
> > host. So, what I said above.
>
> I've also replied to another message about the scale invariance issue,
> and I'm not convinced the frequency based interface proposed here really
> makes sense. An AMU-like interface is very likely to be superior.
>

Some sort of AMU-based interface was discussed offline with Saravana,
but I'm not sure how to best implement that. If you have any pointers
to get started, that would be helpful.

> > > Thinking about it, using SCMI over virtio would implement exactly that.
> > > Linux-as-a-guest already supports it IIRC, so possibly the problem
> > > being addressed in this series could be 'simply' solved using an SCMI
> > > backend in the VMM...
> >
> > This will be worse than all the options we've tried so far because it
> > has the userspace overhead AND uclamp overhead.
>
> But it doesn't violate the whole KVM userspace delegation model, so we
> should start from there and then optimize further if need be.

Do you have any references we can experiment with getting started for
SCMI? (ex. SCMI backend support in CrosVM).

For RFC V3, I'll post a CPUfreq driver implementation that only uses
MMIO and without any kernel host modifications(I.E. Only using uclamp
as a knob to tune the host) along with performance numbers and then
work on optimizing from there.

Thanks,
David

>
> Thanks,
> Quentin