Re: Nested VMX - L1 hangs on running L2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
> > > No, both patches are wrong.
> >
> > > The correct fix is to make kvm_get_msr() return the L1 guest TSC at all
> > times.
> > >  We are serving the L1 guest in this hypervisor, not the L2 guest, and so
> >
> > > should never read its TSC for any reason.

I've been thinking about how to correctly solve this problem, and about this
statement that you made, that get_read_tsc() should *always* return L1's TSC,
even while running L2.

We have two quite-different uses for the TSC-related functions:

 1. They are used in vmx.c to emulate *architectural* instructions, namely
    RDTSC, RDMSR of TSC, and WRMSR of TSC.

 2. They are used in x86.c for internal KVM uses, adjusting the TSC for
    various reasons (most of which, I have to admit, I don't understand).

You said above that kvm_get_msr(), calling guest_read_tsc(), should always
return L1's TSC, even when running L2. However, I believe that that for
use #1 - the *architecturally correct* emulation of RDTSC and RDMSR when
L2 is running, the current nested-agnostic code - which ends up returning L2's
TSC when L2 is running - is the correct thing to do.

Look at what happens when:

 When we get an exit to L0 when L2 run RDMSR TSC:
   This may exit to L1, if L1 asked in the MSR Bitmap. Done correctly.
   If hasn't exited, the spec says we need to return L1's TSC plus the
   TSC_OFFSET that L1 set for L2 (if L1 asked for TSC_OFFSETTING).
   In other words, we should return L0's TSC offsetted by the sum of vmcs01
   and vmcs12 tsc_offsets. This sum is exactly what we have in the current
   (vmcs02) TSC_OFFSET. So the existing guest_read_tsc, which returns
   rdtscll() + vmcs_read64(TSC_OFFSET), does the correct thing - also for
   nested.

 When we get an exit to L0 when L2 runs RDTSC:
   This may exit to L1 if L1 asked for TSC_EXITING. Done correctly.
   There is no "else" here - we would get an EXIT_REASON_RDTSC only if L1
   asked for it (because KVM itself doesn't use TSC_EXITING).

Regarding emulation of WRMSR TSC, I actually had an error in the code which
I correct in a patch below: According to the Intel spec, it turns out that
if WRMSR of the TSC does not cause an exit, then it should change the host
TSC itself, ignoring TSC_OFFSET. This is very strange, because it means that
if the host doesn't trap MSR read and write of the TSC, then if a guest
writes TSC and then immediately reads it, it won't read what is just wrote,
but rather what it wrote pluss the TSC_OFFSET (because the TSC_OFFSET is
added on read, but not subtracted on write). I don't know why the Intel spec
specifies this, but it does - and it's not what I emulated to I fixed that
now (in the patch below). But I doubt this is the problem that Bandan and
others saw?

So now, if I'm right, after this patch, the *architectural* uses of
guest_read_tsc() and vmx_write_tsc_offset() to emulate RDTSR and RD/WRMSR,
are done correctly.

But the problem is that these functions are also used in KVM for other things.
Ideally, whomever calls kvm_read_msr()/kvm_write_msr() should realize that it
gets the value for the current running guest, which might be L1 or L2, just
like if that guest would read or write the MSR. Ideally, KVM might read the
MSR, change it and write it back, and everything would be correct as both
read and write code are done correctly. The question is, is some of the code -
perhaps in kvm_guest_time_update(), kvm_arch_vcpu_load(), kvm_arch_vcpu_put()
or vcpu_enter_guest() (these are functions which touch the TSC), which only
work correctly when reading/writing L1 TSC and can't work correctly if L2
TSC is read and written? I'm afraid I don't know enough about what these
functions are doing, to understand what, if anything, they are doing wrong.

Bandan and others who can reproduce this bug - I attach below a patch which
should correct some architectural emulation issues I discovered while reading
the code again now - especially of TSC MSR write. Can you please try if
this solves anything? I'm not too optimistic, but since the code was wrong
in this regard, it's at least worth a try.

Thanks,
Nadav.

Subject: [PATCH] nVMX: Fix nested TSC handling

This patch fixes several areas where nested VMX did not correctly emulated
TSC handling.

Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx>
---
 arch/x86/kvm/vmx.c |   31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-07-28 14:02:52.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-07-28 14:02:52.000000000 +0300
@@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v
  */
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
-	vmcs_write64(TSC_OFFSET, offset);
-	if (is_guest_mode(vcpu))
+	if (is_guest_mode(vcpu)) {
 		/*
-		 * We're here if L1 chose not to trap the TSC MSR. Since
-		 * prepare_vmcs12() does not copy tsc_offset, we need to also
-		 * set the vmcs12 field here.
+		 * We're here if L1 chose not to trap WRMSR to TSC. According
+		 * to the spec, this should set L1's TSC; The offset that L1
+		 * set for L2 remains unchanged, and still needs to be added
+		 * to the newly set TSC to get L2's TSC.
 		 */
-		get_vmcs12(vcpu)->tsc_offset = offset -
-			to_vmx(vcpu)->nested.vmcs01_tsc_offset;
+		struct vmcs12 *vmcs12;
+		to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset;
+		/* recalculate vmcs02.TSC_OFFSET: */
+		vmcs12 = get_vmcs12(vcpu);
+		vmcs_write64(TSC_OFFSET, offset +
+			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
+			 vmcs12->tsc_offset : 0));
+	} else {
+		vmcs_write64(TSC_OFFSET, offset);
+	}
 }
 
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
@@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc
 
 	set_cr4_guest_host_mask(vmx);
 
-	vmcs_write64(TSC_OFFSET,
-		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
+		vmcs_write64(TSC_OFFSET,
+			vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	else
+		vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
 
 	if (enable_vpid) {
 		/*
@@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm
 
 	load_vmcs12_host_state(vcpu, vmcs12);
 
-	/* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
+	/* Update TSC_OFFSET if TSC was changed while L2 ran */
 	vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
 
 	/* This is needed for same reason as it was needed in prepare_vmcs02 */
-- 
Nadav Har'El                        |    Thursday, Jul 28 2011, 26 Tammuz 5771
nyh@xxxxxxxxxxxxxxxxxxx             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Just remember that if the world didn't
http://nadav.harel.org.il           |suck, we would all fall off.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux