Re: [RFC] KVM: x86: Allow userspace exit on HLT and MWAIT, else yield on MWAIT

Alexander Graf <graf@xxxxxxxxx> · Mon, 18 Sep 2023 13:59:50 +0200

On 18.09.23 13:10, David Woodhouse wrote:
On Mon, 2023-09-18 at 11:41 +0200, Alexander Graf wrote:
IIUC you want to do work in a user space vCPU thread when the guest vCPU
is idle. As you pointed out above, KVM can not actually do much about
MWAIT: It basically busy loops and hogs the CPU.
Well.. I suspect what I *really* want is a decent way to emulate MWAIT
properly and let it actually sleep. Or failing that, to declare that we
can actually change the guest-visible experience when those guests are
migrated to KVM, and take away MWAIT completely.

The typical flow I would expect for "work in a vCPU thread" is:

0) vCPU runs. HLT/MWAIT is directly exposed to guest.
1) vCPU exits. Creates deferred work. Enables HLT/MWAIT trapping.
That can happen, but it may also be a separate I/O thread which
receives an eventfd notification and finds that there is now work to be
done. If that work can be fairly much instantaneous, it can be done
immediately. Else it gets deferred to what we Linux hackers might think
of as a workqueue.

If all the vCPUs are in HLT when the work queue becomes non-empty, we'd
need to prod them *all* to change their exit-on-{HLT,MWAIT} status when
work becomes available, just in case one of them becomes idle and can
process the work "for free" using idle cycles.

2) vCPU runs again
3) vCPU calls HLT/MWAIT. We exit to user space to finish work from 1
4) vCPU runs again without HLT/MWAIT trapping

That means on top (or instead?) of the bits you have below that indicate
"Should I exit to user space?", what you really need are bits that do
what enable_cap(KVM_CAP_X86_DISABLE_EXITS) does in light-weight: Disable
HLT/MWAIT trapping temporarily.
If I do it that way, yes. A lightweight way to enable/disable the exits
even to kernel would be a nice to have. But it's a trade-off. For HLT
you'd get lower latency re-entering the vCPU at a cost of much higher
latency processing work if the vCPU was *already* in HLT.

We probably would want to stop burning power in the MWAIT loop though,
and let the pCPU sit in the guest in MWAIT if there really is nothing
else to do.

We're experimenting with various permutations.

Also, please keep in mind that you still would need a fallback mechanism
to run your "deferred work" even when the guest does not call HLT/MWAIT,
like a regular timer in your main thread.
Yeah. In that case I think the ideal answer is that we let the kernel
scheduler sort it out. I was thinking of a model where we have I/O (or
workqueue) threads in *addition* to the userspace exits on idle. The
separate threads own the work (and a number of them are woken according
to the queue depth), and idle vCPUs *opportunistically* process work
items on top of that.

That approach alone would work fine with the existing HLT scheduling;
it's just MWAIT which is a pain because yield() doesn't really do much
(but as noted, it's better than *nothing*).

On top of all this, I'm not sure it's more efficient to do the trap to
the vCPU thread compared to just creating a separate real thread. Your
main problem is the emulatability of MWAIT because that leaves "no time"
to do deferred work. But then again, if your deferred work is so complex
that it needs more than a few ms (which you can always steal from the
vCPU thread, especiall with yield()), you'll need to start implementing
time slicing of that work in user space next - and basically rebuild
your own scheduler there. Ugh.

IMHO the real core value of this idea would be in a vcpu_run bit that on
VCPU_RUN can toggle between HLT/MWAIT intercept on and off. The actual
trap to user space, you're most likely better off with a separate thread.
No, that's very much not the point. The problem is that yield() doesn't
work well enough — and isn't designed or guaranteed to do anything in
particular for most cases. It's better than *nothing* but we want the
opportunity to do the actual work right there in the *loop* of the
guest bouncing through MWAIT.

The problem with MWAIT is that you don't really know when it's done.

You could find out by making MONITOR'ed pages(!) read-only so you can 
wake up any target vCPU that's in MWAIT, but that's considerably 
expensive if you want to do it well.

You could also burn one VM/system wide CPU that does nothing but waits 
for changes in any MONITOR'ed cache line. Doable with less power 
consumption if you use TSX I guess. But probably not what you want either.

Another alternative would be to make guests PV aware, so they understand 
you don't actually do MWAIT and give you a hypercall every time they 
modify whatever anyone would want to monitor (such as 
thread_info->flags). But that requires new guest kernels. I don't think 
you want to wait for that :).

So in a nutshell, emulating MWAIT properly is just super difficult. If 
you have even the remotest chance to get away with doing HLT instead, 
I'd take that. In that model, an I/O thread that schedules over idle 
threads becomes natural.

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879