Re: Nested VMX - L1 hangs on running L2

Matt McGill <matt.mcgill@xxxxxxxxx> · Thu, 28 Jul 2011 22:06:27 -0400

Hello all,

I've been poking at this bug for awhile and following the discussion, so I thought I'd bring all the information together in one place.

First, I've been able to reliably reproduce this bug. Here is (what I believe to be) the relevant information:
* Host setup (8 CPUs):
Ubuntu Maverick with the kvm HEAD kernel, on an Intel i7.

* L1 guest (1 virtual CPU):
Ubuntu Maverick, standard distro kernel, started like so:
qemu-system-x86_64 -enable-kvm -cpu qemu64,+vmx -hda mav-mini.img \
  -nographic -vnc :1 -net nic -net tap,script="${tapup}"  -serial file:/tmp/mav-mini-console

* L2 guest (1 virtual CPU)
The 'linux-0.2.img' image from the QEMU web site [http://wiki.qemu.org/download/linux-0.2.img.bz2], started like so:
qemu-system-x86_64 -enable-kvm -hda linux-0.2.img -nographic -vnc 172.17.1.5:1 -net none

Steps to reproduce:
1. Start L1 guest.
2. In L1 guest, ensure that clocksource is 'kvm-clock'.
	echo "kvm-clock" | sudo tee /sys/devices/system/clocksource/clocksource0/current_clocksource
3. In L1 guest, start L2 guest.
4. In L2 guest, run the nbench benchmark.
5. On the host, use taskset to forcibly kick the qemu-system-x86_64 process to different host CPUs until L1 hangs.

Next, two work-arounds have been mentioned. For me, either one by itself is sufficient to prevent hangs.

Workaround #1: Use the 'jiffies' clocksource in L1. I suspect any other clocksource should also work, but haven't tested them.
Workaround #2: Using taskset, pin the qemu-system-x86_64 process managing L1 to a single host CPU.

The previous messages to the list have also already pointed at the problem, but not in full detail. For clarity, here's the sequence that appears to trigger the hang:
1. The qemu-system-x86_64 process migrates to a new host CPU.

2. On the next call to kvm_arch_vcpu_load() in kvm/x86.c, we enter this branch:
     if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
        ....
        kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
        ....
    }

3. The KVM_REQ_CLOCK_UPDATE request gets handled in vcpu_enter_guest() [kvm/x86.c].
        if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
            r = kvm_guest_time_update(vcpu);
            ....
        }    

4. In kvm_guest_time_update() [kvm/x86.c], we call kvm_get_msr to read MSR_IA32_TSC assuming we'll get the TSC value of L1:

  static int kvm_guest_time_update(struct kvm_vcpu *v)
  {
      ...
      kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
      ...

This is the heart of the problem. If we're entering L2 rather than L1, tsc_timestamp gets L2's TLC.

We then do some computation based on wall clock/system time, and we write the (possibly adjusted) TSC value directly into a data structure used by the L1 kvm-clock (see also arch/x86/kernel/kvmclock.c and include/asm/kvm_host.h):
    /* With all the info we got, fill in the values */
    vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
    vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
    vcpu->last_kernel_ns = kernel_ns;
    vcpu->last_guest_tsc = tsc_timestamp;
    vcpu->hv_clock.flags = 0;
Note that vcpu->hv_clock is actually mapped to a guest page (vcpu->time_page), and shared directly with the L1 kvm-clock paravirtualized clocksource.

5. If we read L2's TSC in the previous step, the L1 kvm-clock now has an arbitrary value. Presumably the scheduler becomes horribly confused, and nothing gets scheduled. This is illustrated by the output I get from the L1 soft lockup detector (stuck for 169 years, eh?):
	[5750304228.280373] BUG: soft lockup - CPU#0 stuck for 5355388067s! [qemu-system-x86:731]
As further evidence, if I patch guest_read_tsc() so that it always returns the L1 TSC, I can no longer reproduce the hang.

As for the correct way to fix it, I don't yet understand the VMX spec or the possible cases well enough to know what the correct behavior should be.

-Matt McGill

On Jul 28, 2011, at 7:11 AM, Nadav Har'El wrote:

> On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
>>>> No, both patches are wrong.
>>> 
>>>> The correct fix is to make kvm_get_msr() return the L1 guest TSC at all
>>> times.
>>>>  We are serving the L1 guest in this hypervisor, not the L2 guest, and so
>>> 
>>>> should never read its TSC for any reason.
> 
> I've been thinking about how to correctly solve this problem, and about this
> statement that you made, that get_read_tsc() should *always* return L1's TSC,
> even while running L2.
> 
> We have two quite-different uses for the TSC-related functions:
> 
> 1. They are used in vmx.c to emulate *architectural* instructions, namely
>    RDTSC, RDMSR of TSC, and WRMSR of TSC.
> 
> 2. They are used in x86.c for internal KVM uses, adjusting the TSC for
>    various reasons (most of which, I have to admit, I don't understand).
> 
> You said above that kvm_get_msr(), calling guest_read_tsc(), should always
> return L1's TSC, even when running L2. However, I believe that that for
> use #1 - the *architecturally correct* emulation of RDTSC and RDMSR when
> L2 is running, the current nested-agnostic code - which ends up returning L2's
> TSC when L2 is running - is the correct thing to do.
> 
> Look at what happens when:
> 
> When we get an exit to L0 when L2 run RDMSR TSC:
>   This may exit to L1, if L1 asked in the MSR Bitmap. Done correctly.
>   If hasn't exited, the spec says we need to return L1's TSC plus the
>   TSC_OFFSET that L1 set for L2 (if L1 asked for TSC_OFFSETTING).
>   In other words, we should return L0's TSC offsetted by the sum of vmcs01
>   and vmcs12 tsc_offsets. This sum is exactly what we have in the current
>   (vmcs02) TSC_OFFSET. So the existing guest_read_tsc, which returns
>   rdtscll() + vmcs_read64(TSC_OFFSET), does the correct thing - also for
>   nested.
> 
> When we get an exit to L0 when L2 runs RDTSC:
>   This may exit to L1 if L1 asked for TSC_EXITING. Done correctly.
>   There is no "else" here - we would get an EXIT_REASON_RDTSC only if L1
>   asked for it (because KVM itself doesn't use TSC_EXITING).
> 
> Regarding emulation of WRMSR TSC, I actually had an error in the code which
> I correct in a patch below: According to the Intel spec, it turns out that
> if WRMSR of the TSC does not cause an exit, then it should change the host
> TSC itself, ignoring TSC_OFFSET. This is very strange, because it means that
> if the host doesn't trap MSR read and write of the TSC, then if a guest
> writes TSC and then immediately reads it, it won't read what is just wrote,
> but rather what it wrote pluss the TSC_OFFSET (because the TSC_OFFSET is
> added on read, but not subtracted on write). I don't know why the Intel spec
> specifies this, but it does - and it's not what I emulated to I fixed that
> now (in the patch below). But I doubt this is the problem that Bandan and
> others saw?
> 
> So now, if I'm right, after this patch, the *architectural* uses of
> guest_read_tsc() and vmx_write_tsc_offset() to emulate RDTSR and RD/WRMSR,
> are done correctly.
> 
> But the problem is that these functions are also used in KVM for other things.
> Ideally, whomever calls kvm_read_msr()/kvm_write_msr() should realize that it
> gets the value for the current running guest, which might be L1 or L2, just
> like if that guest would read or write the MSR. Ideally, KVM might read the
> MSR, change it and write it back, and everything would be correct as both
> read and write code are done correctly. The question is, is some of the code -
> perhaps in kvm_guest_time_update(), kvm_arch_vcpu_load(), kvm_arch_vcpu_put()
> or vcpu_enter_guest() (these are functions which touch the TSC), which only
> work correctly when reading/writing L1 TSC and can't work correctly if L2
> TSC is read and written? I'm afraid I don't know enough about what these
> functions are doing, to understand what, if anything, they are doing wrong.
> 
> Bandan and others who can reproduce this bug - I attach below a patch which
> should correct some architectural emulation issues I discovered while reading
> the code again now - especially of TSC MSR write. Can you please try if
> this solves anything? I'm not too optimistic, but since the code was wrong
> in this regard, it's at least worth a try.
> 
> Thanks,
> Nadav.
> 
> Subject: [PATCH] nVMX: Fix nested TSC handling
> 
> This patch fixes several areas where nested VMX did not correctly emulated
> TSC handling.
> 
> Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx>
> ---
> arch/x86/kvm/vmx.c |   31 +++++++++++++++++++++----------
> 1 file changed, 21 insertions(+), 10 deletions(-)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-07-28 14:02:52.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-07-28 14:02:52.000000000 +0300
> @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v
>  */
> static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> {
> -	vmcs_write64(TSC_OFFSET, offset);
> -	if (is_guest_mode(vcpu))
> +	if (is_guest_mode(vcpu)) {
> 		/*
> -		 * We're here if L1 chose not to trap the TSC MSR. Since
> -		 * prepare_vmcs12() does not copy tsc_offset, we need to also
> -		 * set the vmcs12 field here.
> +		 * We're here if L1 chose not to trap WRMSR to TSC. According
> +		 * to the spec, this should set L1's TSC; The offset that L1
> +		 * set for L2 remains unchanged, and still needs to be added
> +		 * to the newly set TSC to get L2's TSC.
> 		 */
> -		get_vmcs12(vcpu)->tsc_offset = offset -
> -			to_vmx(vcpu)->nested.vmcs01_tsc_offset;
> +		struct vmcs12 *vmcs12;
> +		to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset;
> +		/* recalculate vmcs02.TSC_OFFSET: */
> +		vmcs12 = get_vmcs12(vcpu);
> +		vmcs_write64(TSC_OFFSET, offset +
> +			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> +			 vmcs12->tsc_offset : 0));
> +	} else {
> +		vmcs_write64(TSC_OFFSET, offset);
> +	}
> }
> 
> static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
> @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc
> 
> 	set_cr4_guest_host_mask(vmx);
> 
> -	vmcs_write64(TSC_OFFSET,
> -		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
> +		vmcs_write64(TSC_OFFSET,
> +			vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +	else
> +		vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
> 
> 	if (enable_vpid) {
> 		/*
> @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm
> 
> 	load_vmcs12_host_state(vcpu, vmcs12);
> 
> -	/* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
> +	/* Update TSC_OFFSET if TSC was changed while L2 ran */
> 	vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
> 
> 	/* This is needed for same reason as it was needed in prepare_vmcs02 */
> -- 
> Nadav Har'El                        |    Thursday, Jul 28 2011, 26 Tammuz 5771
> nyh@xxxxxxxxxxxxxxxxxxx             |-----------------------------------------
> Phone +972-523-790466, ICQ 13349191 |Just remember that if the world didn't
> http://nadav.harel.org.il           |suck, we would all fall off.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html