Re: [PATCH v4] kvm: better MWAIT emulation for guests

"Gabriel L. Somlo" <gsomlo@xxxxxxxxx> · Wed, 15 Mar 2017 19:22:56 -0400

On Wed, Mar 15, 2017 at 09:46:18PM +0100, Radim Krčmář wrote:
> 2017-03-15 16:21-0400, Gabriel L. Somlo:
> > On Wed, Mar 15, 2017 at 09:13:49PM +0100, Radim Krčmář wrote:
> >> 2017-03-15 21:28+0200, Michael S. Tsirkin:
> >> > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> >> > unless explicitly provided with kernel command line argument
> >> > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> >> > without checking CPUID.
> >> > 
> >> > We currently emulate that as a NOP but on VMX we can do better: let
> >> > guest stop the CPU until timer, IPI or memory change.  CPU will be busy
> >> > but that isn't any worse than a NOP emulation.
> >> > 
> >> > Note that mwait within guests is not the same as on real hardware
> >> > because halt causes an exit while mwait doesn't.  For this reason it
> >> > might not be a good idea to use the regular MWAIT flag in CPUID to
> >> > signal this capability.  Add a flag in the hypervisor leaf instead.
> >> > 
> >> > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> >> > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> >> > to improve guest behaviour.
> >> > 
> >> > Reported-by: "Gabriel L. Somlo" <gsomlo@xxxxxxxxx>
> >> > Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
> >> > ---
> >> > 
> >> > Note: SVM bits are untested at this point. Seems pretty
> >> > obvious though.
> >> > 
> >> > changes from v3:
> >> > - don't enable capability if cli+mwait blocks interrupts
> >> > - doc typo fixes (drop drop ppc doc)
> >> > 
> >> > changes from v2:
> >> > - add a capability to allow host userspace to detect new kernels
> >> > - more documentation to clarify the semantics of the feature flag
> >> >   and why it's useful
> >> > - svm support as suggested by Radim
> >> > 
> >> > changes from v1:
> >> > - typo fix resulting in rest of leaf flags being overwritten
> >> >   Reported by: Wanpeng Li <kernellwp@xxxxxxxxx>
> >> > - updated commit log with data about guests helped by this feature
> >> > - better document differences between mwait and halt for guests
> >> > 
> >> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> >> > @@ -212,4 +213,28 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
> >> >  	    __rem;						\
> >> >  	 })
> >> >  
> >> > +static bool kvm_mwait_in_guest(void)
> >> > +{
> >> > +	unsigned int eax, ebx, ecx;
> >> > +
> >> > +	if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
> >> > +		return -ENODEV;
> >> > +
> >> > +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> >> > +		return -ENODEV;
> >> > +
> >> > +	/*
> >> > +	 * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
> >> > +	 * they would allow guest to stop the CPU completely by disabling
> >> > +	 * interrupts then invoking MWAIT.
> >> > +	 */
> >> > +	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
> >> > +		return -ENODEV;
> >> > +
> >> > +	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
> >> > +
> >> > +	if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> >> > +		return -ENODEV;
> >> 
> >> The guest is still able to set ecx=0 with MWAIT, which should be the
> >> same as not having the CPUID flag, so I'm wondering how this check
> >> prevents anything harmful ... is it really a cpu "feature"?
> >> 
> >> If we somehow report ecx bit 1 in CPUID[5], then the guest might try to
> >> set ecx bit 0 for MWAIT, which will cause #GP(0) and could explain the
> >> hang that Gabriel is hitting.
> >> 
> >> Gabriel,
> >> 
> >>  - do you see VM exits on the "hung" VCPU?
> > 
> > how would I go about looking ?
> 
> Probably the simplest would be to try install trace-cmd and do
> 
>   trace-cmd record -e kvm:kvm_entry
> 
> for a while, then killing it with ^C and calling
> 
>   trace-cmd report
> 
> it will have lines like:
> 
>   CPU-3729  [021]  5046.222480: kvm_entry:            vcpu 7
> 
> the first column is task id, the last one is VCPU id, so you'd verify that all
> VCPUs eventually enter.
> You could either look at the taks id of VCPUs that are running 100% of the time
> or just pipe through `grep -o 'vcpu.*' | sort -u` and hope that all of them are
> running, so you'd see nice list of them all. :)

so, with Michael's patch v5 (same behavior as earlier btw), I get:

# trace-cmd report | head -10000 | grep -o 'vcpu.*' | sort -u
vcpu 0
vcpu 1
vcpu 2
vcpu 3

Which means they're all there.

Also, 'top' shows 400% CPU for qemu-system-x86_64, which means I have
4 threads idle-looping at 100% each, which is consistent with a guest
using an MWAIT-based idle loop.

> If you don't see some VCPUs there, they might have been halted, which would
> show on
> 
>   virsh qemu-monitor-command rhel7 --hmp info cpus
> 
> If there are running, but never entering VCPUs, then running
> 
>   trace-cmd record -e kvm:\*
> 
> to capture all kvm events could tell us why it is stuck.
> 
> >>  - what is your CPU model?
> > 
> > $ cat /proc/cpuinfo
> > model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
> > 
> > (this is 2x dual-core Xeon on a Mac Pro 1,1 -- all I had to spare for
> > testing, to avoid having to reboot my primary desktop :)
> 
> Oh, ancient when it comes to VMX. :)

I also have a somewhat more recent (only 4 year old) Macbook Air, I'll
try to redo the whole set of tests on that one as well.

> 
> >>  - what do you get after running this C program on host and guest?
> > 
> > eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x000020
> 
> Your CPU has the CPUID5_ECX_INTERRUPT_BREAK, thanks.
> The guest should see QEMU's default, which is
> 
>   eax=0x000000 ebx=0x000000 ecx=0x000003 edx=0x000000

That's right, I finally managed to transfer a linux guest image over
to the box, and indeed that's what it produces.

More tomorrow once I have a chance to test on the Macbook Air (need to
compile kvm git master on that one first, then transfer the os x qcow
image, etc...

Thanks,
--Gabriel
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html