Re: Nested VMX - L1 hangs on running L2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 29, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
>...
> In that case, you need to distinguish between reads of the TSC MSR by
> the guest and reads by the host (as done internally to track drift and
>...
> Unfortunately, the layering currently doesn't seem to allow for this,
> and it looks like both vendor specific variants currently get this
> wrong.
>...
> 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and
> transform current uses of the code which does TSC compensation to use
> this new API.  *Bonus* - no need to do double indirection through the
> generic MSR code.

Thank you so much for this analysis!

I propose the following two patches. The first one just fixes two small
bugs in the correct emulation in the VMX case, and the second patch solves
the originally-reported bug as you outlined above:

The code in x86.c which used to call *_get_msr() to get the TSC no longer
does it - because *_get_msr() should only be called with the intention of
reading the msr of the current guest (L1 or L2) and returning it to that
guest.

Instead, I added a new x86_op read_l1_tsc(vcpu), whose intention is to
always return L1's notion of the current TSC - even if the current guest
is L2. All calls in x86.c now use this new read_l1_tsc() function instead
of reading the MSR - does this look correct to you?

> read via RDMSR which is mis-virtualized (this bug exists today in the
> SVM implementation if I am reading it correctly - cc'd Joerg to notify

I believe that you're right. I created (in the patch below) svm.c's
read_l1_tsc() with the same code you had in svm_get_msr(), but I think
that in svm_get_msr() the code should be different in the nested case -
I think you should use svm->vmcb, not get_host_vmcb()? I didn't add this svm.c
fix to the patch, because I'm not sure at all that my understanding here is
correct.

Zachary, Marcelo, do these patches look right to you?
Bandan, and others who case reproduce this bug - do these patches solve the
bug?

> So you are right, this is still wrong for the case in which L1 does
> not trap TSC MSR reads.  Note however, the RDTSC instruction is still
> virtualized properly, it is only the relatively rare actual TSC MSR
> read via RDMSR which is mis-virtualized (this bug exists today in the
> SVM implementation if I am reading it correctly - cc'd Joerg to notify
> him of that).  That, combined with the relative importance of
> supporting a guest which does not trap on these MSR reads suggest this
> is a low priority design issue however (RDTSC still works even if the
> MSR is trapped, correct?)

Yes, RDTSC_EXITING is separate from the MSR bitmap. I agree that these
are indeed obscure corner cases, but I think that it's still really confusing
that a function called kvm_get_msr deliberately works incorrectly (returning
always the L1 TSC) just so that it can fit some other uses. The SVM code worked
in the common case, but was still confusing why svm_get_msr() deliberately
goes to great lengths to use L1's TSC_OFFSET even when L2 is running - when
this is not how this MSR is really supposed to behave. Not to mention that one
day, somebody *will* want to use these obscure settings and be surprise that
they don't work ;-)

Subject: [PATCH 1/2] nVMX: Fix nested TSC emulation

This patch fixes two corner cases in nested (L2) handling of TSC-related
exits:

1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
the TSC MSR without an exit, then this should set L1's TSC value itself - not
offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).

2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
the vmcs12.TSC_OFFSET.

Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx>
---
 arch/x86/kvm/vmx.c |   31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-07-31 16:17:30.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-07-31 16:17:30.000000000 +0300
@@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v
  */
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
-	vmcs_write64(TSC_OFFSET, offset);
-	if (is_guest_mode(vcpu))
+	if (is_guest_mode(vcpu)) {
 		/*
-		 * We're here if L1 chose not to trap the TSC MSR. Since
-		 * prepare_vmcs12() does not copy tsc_offset, we need to also
-		 * set the vmcs12 field here.
+		 * We're here if L1 chose not to trap WRMSR to TSC. According
+		 * to the spec, this should set L1's TSC; The offset that L1
+		 * set for L2 remains unchanged, and still needs to be added
+		 * to the newly set TSC to get L2's TSC.
 		 */
-		get_vmcs12(vcpu)->tsc_offset = offset -
-			to_vmx(vcpu)->nested.vmcs01_tsc_offset;
+		struct vmcs12 *vmcs12;
+		to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset;
+		/* recalculate vmcs02.TSC_OFFSET: */
+		vmcs12 = get_vmcs12(vcpu);
+		vmcs_write64(TSC_OFFSET, offset +
+			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
+			 vmcs12->tsc_offset : 0));
+	} else {
+		vmcs_write64(TSC_OFFSET, offset);
+	}
 }
 
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
@@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc
 
 	set_cr4_guest_host_mask(vmx);
 
-	vmcs_write64(TSC_OFFSET,
-		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
+		vmcs_write64(TSC_OFFSET,
+			vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	else
+		vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
 
 	if (enable_vpid) {
 		/*
@@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm
 
 	load_vmcs12_host_state(vcpu, vmcs12);
 
-	/* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
+	/* Update TSC_OFFSET if TSC was changed while L2 ran */
 	vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
 
 	/* This is needed for same reason as it was needed in prepare_vmcs02 */

Subject: [PATCH 2/2] nVMX: L1 TSC handling

KVM assumed in several places that reading the TSC MSR returns the value for
L1. This is incorrect, because when L2 is running, the correct TSC read exit
emulation is to return L2's value.

We therefore add a new x86_ops function, read_l1_tsc, to use in places that
specifically need to read the L1 TSC, NOT the TSC of the current level of
guest.

Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx>
---
 arch/x86/include/asm/kvm_host.h |    2 ++
 arch/x86/kvm/svm.c              |    9 +++++++++
 arch/x86/kvm/vmx.c              |   17 +++++++++++++++++
 arch/x86/kvm/x86.c              |    8 ++++----
 4 files changed, 32 insertions(+), 4 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h	2011-07-31 16:17:30.000000000 +0300
+++ .after/arch/x86/include/asm/kvm_host.h	2011-07-31 16:17:30.000000000 +0300
@@ -636,6 +636,8 @@ struct kvm_x86_ops {
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage);
 
+	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
+
 	const struct trace_print_flags *exit_reasons_str;
 };
 
--- .before/arch/x86/kvm/vmx.c	2011-07-31 16:17:30.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-07-31 16:17:30.000000000 +0300
@@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void)
 }
 
 /*
+ * Like guest_read_tsc, but always returns L1's notion of the timestamp
+ * counter, even if a nested guest (L2) is currently running.
+ */
+u64 vmx_read_l1_tsc(kvm_vcpu *vcpu)
+{
+	u64 host_tsc, tsc_offset;
+
+	rdtscll(host_tsc);
+	tsc_offset = is_guest_mode(vcpu) ?
+		to_vmx(vcpu)->nested.vmcs01_tsc_offset :
+		vmcs_read64(TSC_OFFSET);
+	return host_tsc + tsc_offset;
+}
+
+/*
  * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ
  * ioctl. In this case the call-back should update internal vmx state to make
  * the changes effective.
@@ -7070,6 +7085,8 @@ static struct kvm_x86_ops vmx_x86_ops = 
 	.set_tdp_cr3 = vmx_set_cr3,
 
 	.check_intercept = vmx_check_intercept,
+
+	.read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu),
 };
 
 static int __init vmx_init(void)
--- .before/arch/x86/kvm/svm.c	2011-07-31 16:17:30.000000000 +0300
+++ .after/arch/x86/kvm/svm.c	2011-07-31 16:17:30.000000000 +0300
@@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct
 	return 0;
 }
 
+u64 svm_read_l1_tsc(kvm_vcpu *vcpu)
+{
+	return vmcb->control.tsc_offset +
+			svm_scale_tsc(vcpu, native_read_tsc());
+}
+
+
 static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops = 
 	.set_tdp_cr3 = set_tdp_cr3,
 
 	.check_intercept = svm_check_intercept,
+
+	.read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu),
 };
 
 static int __init svm_init(void)
--- .before/arch/x86/kvm/x86.c	2011-07-31 16:17:30.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-07-31 16:17:30.000000000 +0300
@@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct 
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
+	tsc_timestamp = kvm_x86_ops->read_l1_tsc(vcpu);
 	kernel_ns = get_kernel_ns();
 	this_tsc_khz = vcpu_tsc_khz(v);
 	if (unlikely(this_tsc_khz == 0)) {
@@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 		s64 tsc_delta;
 		u64 tsc;
 
-		kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc);
+		tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 		tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
 			     tsc - vcpu->arch.last_guest_tsc;
 
@@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *
 {
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
-	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 }
 
 static int is_efer_nx(void)
@@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v
 	if (hw_breakpoint_active())
 		hw_breakpoint_restore();
 
-	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();

-- 
Nadav Har'El                        |      Sunday, Jul 31 2011, 29 Tammuz 5771
nyh@xxxxxxxxxxxxxxxxxxx             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Support bacteria - they're the only
http://nadav.harel.org.il           |culture some people have!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux