Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

"Gabriel L. Somlo" <gsomlo@xxxxxxxxx> · Thu, 16 Mar 2017 09:24:27 -0400

On Thu, Mar 16, 2017 at 01:41:28AM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 15, 2017 at 07:35:34PM -0400, Gabriel L. Somlo wrote:
> > On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> > > unless explicitly provided with kernel command line argument
> > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> > > without checking CPUID.
> > > 
> > > We currently emulate that as a NOP but on VMX we can do better: let
> > > guest stop the CPU until timer, IPI or memory change.  CPU will be busy
> > > but that isn't any worse than a NOP emulation.
> > > 
> > > Note that mwait within guests is not the same as on real hardware
> > > because halt causes an exit while mwait doesn't.  For this reason it
> > > might not be a good idea to use the regular MWAIT flag in CPUID to
> > > signal this capability.  Add a flag in the hypervisor leaf instead.
> > > 
> > > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> > > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> > > to improve guest behaviour.
> > 
> > Same behavior (on the mac pro 1,1 running F22 with custom-compiled
> > kernel from kvm git master, plus this patch on top).
> > 
> > The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
> > on boot, does not bring up guest graphical interface within the first
> > 10 minutes that I waited for it. That, in contrast with the default
> > nop-based emulation where the guest comes up within 30 seconds.
> 
> 
> Thanks a lot, meanwhile I'll try to write a unit-test and experiment
> with various behaviours.
> 
> > I will run another round of tests on a newer Mac (4-year-old macbook
> > air) and report back tomorrow.
> > 
> > Going off on a tangent, why would encouraging otherwise well-behaved
> > guests (like linux ones, for example) to use MWAIT be desirable to
> > begin with ? Is it a matter of minimizing the overhead associated with
> > exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
> > running guest-mode MWAIT in a tight loop will actually waste the host
> > CPU without the opportunity to yield to some other L0 thread. Sorry if
> > I fell into the middle of an ongoing conversation on this and missed
> > most of the relevant context, in which case please feel free to ignore
> > me... :)
> > 
> > Thanks,
> > --G
> 
> It's just some experiments I'm running, I'm not ready to describe it
> yet. I thought this part might be useful to at least some guests, so
> trying to upstream it right now.

OK, so on a macbook air running F25 and the latest kvm git master plus
your v5 patch (4.11.0-rc2+), things appear to work.

host-side cpuid output:
eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x021120

guest-side cpuid output:
eax=00000000 ebx=00000000 ecx=0x000003 edx=00000000

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i7-2677M CPU @ 1.80GHz
stepping	: 7
microcode	: 0x29
cpu MHz		: 1157.849
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
bugs		:
bogomips	: 3604.68
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

After studying your patch a bit more carefully (sorry, it's crazy
around here right now :) ) I realized you're simply trying to
(selectively) decide when to exit L1 and emulate as NOP vs. when to
just allow L1 to execute MONITOR & MWAIT natively.

Is that right ? Because if so, the issues I saw on my MacPro1,1 are
weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
natively was one of the options Alex Graf and Rene Rebe used back in
the very early days of OS X on QEMU, at the time I got involved with
that project. Here's part of an out of tree patch against 3.4 which did
just that, and worked as far as I remember on *any* MWAIT capable
intel chip I had access to back in 2010:

##############################################################################
# 99-mwait.patch.kvm-kmod (Rene Rebe <rene@xxxxxxxxxxxx>) 2010-04-27
##############################################################################
diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c

--- linux-3.4/arch/x86/kvm/cpuid.c	2012-05-20 18:29:13.000000000 -0400
+++ linux-3.4-mac/arch/x86/kvm/cpuid.c	2012-10-09 11:42:59.921215750 -0400
@@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
 		f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
 		F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
 		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
 	/* cpuid 1.ecx */
 	const u32 kvm_supported_word4_x86_features =
-		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
+		F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
 		0 /* DS-CPL, VMX, SMX, EST */ |
 		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
 		F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
 		0 /* Reserved, DCA */ | F(XMM4_1) |
 		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
--- linux-3.4/arch/x86/kvm/svm.c	2012-05-20 18:29:13.000000000 -0400
+++ linux-3.4-mac/arch/x86/kvm/svm.c	2012-10-09 11:44:41.598997481 -0400
@@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
 	set_intercept(svm, INTERCEPT_VMSAVE);
 	set_intercept(svm, INTERCEPT_STGI);
 	set_intercept(svm, INTERCEPT_CLGI);
 	set_intercept(svm, INTERCEPT_SKINIT);
 	set_intercept(svm, INTERCEPT_WBINVD);
-	set_intercept(svm, INTERCEPT_MONITOR);
-	set_intercept(svm, INTERCEPT_MWAIT);
 	set_intercept(svm, INTERCEPT_XSETBV);
 
 	control->iopm_base_pa = iopm_base;
 	control->msrpm_base_pa = __pa(svm->msrpm);
 	control->int_ctl = V_INTR_MASKING_MASK;
diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
--- linux-3.4/arch/x86/kvm/vmx.c	2012-05-20 18:29:13.000000000 -0400
+++ linux-3.4-mac/arch/x86/kvm/vmx.c	2012-10-09 11:42:59.925215977 -0400
@@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
 		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
 	nested_vmx_procbased_ctls_low = 0;
 	nested_vmx_procbased_ctls_high &=
 		CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
 		CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
-		CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
+		CPU_BASED_CR3_LOAD_EXITING |
 		CPU_BASED_CR3_STORE_EXITING |
 #ifdef CONFIG_X86_64
 		CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
 #endif
 		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
@@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
 	      CPU_BASED_CR3_LOAD_EXITING |
 	      CPU_BASED_CR3_STORE_EXITING |
 	      CPU_BASED_USE_IO_BITMAPS |
 	      CPU_BASED_MOV_DR_EXITING |
 	      CPU_BASED_USE_TSC_OFFSETING |
-	      CPU_BASED_MWAIT_EXITING |
-	      CPU_BASED_MONITOR_EXITING |
 	      CPU_BASED_INVLPG_EXITING |
 	      CPU_BASED_RDPMC_EXITING;
 
 	opt = CPU_BASED_TPR_SHADOW |
 	      CPU_BASED_USE_MSR_BITMAPS |

If all you're trying to do is (selectively) revert to this behavior,
that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
confused at this point :)

Back in 2010, running MWAIT in L>=1  behaved 100% exactly like a NOP,
didn't power down the physical CPU, just immediately moved on to the
next instruction. As such, there was no power saving and no
opportunity to yield to another L0 thread either, unlike with NOP
emulation at L0.

Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
doing something smarter than just acting as a guest-mode NOP) ?

Thanks,
--Gabriel