Re: Nested VMX - L1 hangs on running L2

Zachary Amsden <zamsden@xxxxxxxxx> · Sun, 31 Jul 2011 11:55:49 -0700

This patch looks good, with one comment noted inline below.  Are there
no other call sites for kvm_get_msr() or which alias some other
function to kvm_get_msr(MSR_IA32_TSC) ?

Did I miss the first patch?

Zach

On Sun, Jul 31, 2011 at 6:48 AM, Nadav Har'El <nyh@xxxxxxxxxxxxxxxxxxx> wrote:
> On Fri, Jul 29, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
>>...
>> In that case, you need to distinguish between reads of the TSC MSR by
>> the guest and reads by the host (as done internally to track drift and
>>...
>> Unfortunately, the layering currently doesn't seem to allow for this,
>> and it looks like both vendor specific variants currently get this
>> wrong.
>>...
>> 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and
>> transform current uses of the code which does TSC compensation to use
>> this new API.  *Bonus* - no need to do double indirection through the
>> generic MSR code.
>
> Thank you so much for this analysis!
>
> I propose the following two patches. The first one just fixes two small
> bugs in the correct emulation in the VMX case, and the second patch solves
> the originally-reported bug as you outlined above:
>
> The code in x86.c which used to call *_get_msr() to get the TSC no longer
> does it - because *_get_msr() should only be called with the intention of
> reading the msr of the current guest (L1 or L2) and returning it to that
> guest.
>
> Instead, I added a new x86_op read_l1_tsc(vcpu), whose intention is to
> always return L1's notion of the current TSC - even if the current guest
> is L2. All calls in x86.c now use this new read_l1_tsc() function instead
> of reading the MSR - does this look correct to you?
>
>> read via RDMSR which is mis-virtualized (this bug exists today in the
>> SVM implementation if I am reading it correctly - cc'd Joerg to notify
>
> I believe that you're right. I created (in the patch below) svm.c's
> read_l1_tsc() with the same code you had in svm_get_msr(), but I think
> that in svm_get_msr() the code should be different in the nested case -
> I think you should use svm->vmcb, not get_host_vmcb()? I didn't add this svm.c
> fix to the patch, because I'm not sure at all that my understanding here is
> correct.
>
> Zachary, Marcelo, do these patches look right to you?
> Bandan, and others who case reproduce this bug - do these patches solve the
> bug?
>
>> So you are right, this is still wrong for the case in which L1 does
>> not trap TSC MSR reads.  Note however, the RDTSC instruction is still
>> virtualized properly, it is only the relatively rare actual TSC MSR
>> read via RDMSR which is mis-virtualized (this bug exists today in the
>> SVM implementation if I am reading it correctly - cc'd Joerg to notify
>> him of that).  That, combined with the relative importance of
>> supporting a guest which does not trap on these MSR reads suggest this
>> is a low priority design issue however (RDTSC still works even if the
>> MSR is trapped, correct?)
>
> Yes, RDTSC_EXITING is separate from the MSR bitmap. I agree that these
> are indeed obscure corner cases, but I think that it's still really confusing
> that a function called kvm_get_msr deliberately works incorrectly (returning
> always the L1 TSC) just so that it can fit some other uses. The SVM code worked
> in the common case, but was still confusing why svm_get_msr() deliberately
> goes to great lengths to use L1's TSC_OFFSET even when L2 is running - when
> this is not how this MSR is really supposed to behave. Not to mention that one
> day, somebody *will* want to use these obscure settings and be surprise that
> they don't work ;-)
>
> Subject: [PATCH 1/2] nVMX: Fix nested TSC emulation
>
> This patch fixes two corner cases in nested (L2) handling of TSC-related
> exits:
>
> 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
> the TSC MSR without an exit, then this should set L1's TSC value itself - not
> offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).
>
> 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
> the vmcs12.TSC_OFFSET.
>
> Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx>
> ---
>  arch/x86/kvm/vmx.c |   31 +++++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 10 deletions(-)
>
> --- .before/arch/x86/kvm/vmx.c  2011-07-31 16:17:30.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c   2011-07-31 16:17:30.000000000 +0300
> @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v
>  */
>  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
> -       vmcs_write64(TSC_OFFSET, offset);
> -       if (is_guest_mode(vcpu))
> +       if (is_guest_mode(vcpu)) {
>                /*
> -                * We're here if L1 chose not to trap the TSC MSR. Since
> -                * prepare_vmcs12() does not copy tsc_offset, we need to also
> -                * set the vmcs12 field here.
> +                * We're here if L1 chose not to trap WRMSR to TSC. According
> +                * to the spec, this should set L1's TSC; The offset that L1
> +                * set for L2 remains unchanged, and still needs to be added
> +                * to the newly set TSC to get L2's TSC.
>                 */
> -               get_vmcs12(vcpu)->tsc_offset = offset -
> -                       to_vmx(vcpu)->nested.vmcs01_tsc_offset;
> +               struct vmcs12 *vmcs12;
> +               to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset;
> +               /* recalculate vmcs02.TSC_OFFSET: */
> +               vmcs12 = get_vmcs12(vcpu);
> +               vmcs_write64(TSC_OFFSET, offset +
> +                       (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> +                        vmcs12->tsc_offset : 0));
> +       } else {
> +               vmcs_write64(TSC_OFFSET, offset);
> +       }
>  }
>
>  static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
> @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc
>
>        set_cr4_guest_host_mask(vmx);
>
> -       vmcs_write64(TSC_OFFSET,
> -               vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +       if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
> +               vmcs_write64(TSC_OFFSET,
> +                       vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +       else
> +               vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
>
>        if (enable_vpid) {
>                /*
> @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm
>
>        load_vmcs12_host_state(vcpu, vmcs12);
>
> -       /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
> +       /* Update TSC_OFFSET if TSC was changed while L2 ran */
>        vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
>
>        /* This is needed for same reason as it was needed in prepare_vmcs02 */
>
> Subject: [PATCH 2/2] nVMX: L1 TSC handling
>
> KVM assumed in several places that reading the TSC MSR returns the value for
> L1. This is incorrect, because when L2 is running, the correct TSC read exit
> emulation is to return L2's value.
>
> We therefore add a new x86_ops function, read_l1_tsc, to use in places that
> specifically need to read the L1 TSC, NOT the TSC of the current level of
> guest.
>
> Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 ++
>  arch/x86/kvm/svm.c              |    9 +++++++++
>  arch/x86/kvm/vmx.c              |   17 +++++++++++++++++
>  arch/x86/kvm/x86.c              |    8 ++++----
>  4 files changed, 32 insertions(+), 4 deletions(-)
>
> --- .before/arch/x86/include/asm/kvm_host.h     2011-07-31 16:17:30.000000000 +0300
> +++ .after/arch/x86/include/asm/kvm_host.h      2011-07-31 16:17:30.000000000 +0300
> @@ -636,6 +636,8 @@ struct kvm_x86_ops {
>                               struct x86_instruction_info *info,
>                               enum x86_intercept_stage stage);
>
> +       u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
> +
>        const struct trace_print_flags *exit_reasons_str;
>  };
>
> --- .before/arch/x86/kvm/vmx.c  2011-07-31 16:17:30.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c   2011-07-31 16:17:30.000000000 +0300
> @@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void)
>  }
>
>  /*
> + * Like guest_read_tsc, but always returns L1's notion of the timestamp
> + * counter, even if a nested guest (L2) is currently running.
> + */
> +u64 vmx_read_l1_tsc(kvm_vcpu *vcpu)
> +{
> +       u64 host_tsc, tsc_offset;
> +
> +       rdtscll(host_tsc);
> +       tsc_offset = is_guest_mode(vcpu) ?
> +               to_vmx(vcpu)->nested.vmcs01_tsc_offset :
> +               vmcs_read64(TSC_OFFSET);
> +       return host_tsc + tsc_offset;
> +}
> +
> +/*
>  * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ
>  * ioctl. In this case the call-back should update internal vmx state to make
>  * the changes effective.
> @@ -7070,6 +7085,8 @@ static struct kvm_x86_ops vmx_x86_ops =
>        .set_tdp_cr3 = vmx_set_cr3,
>
>        .check_intercept = vmx_check_intercept,
> +
> +       .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu),
>  };
>
>  static int __init vmx_init(void)
> --- .before/arch/x86/kvm/svm.c  2011-07-31 16:17:30.000000000 +0300
> +++ .after/arch/x86/kvm/svm.c   2011-07-31 16:17:30.000000000 +0300
> @@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct
>        return 0;
>  }
>
> +u64 svm_read_l1_tsc(kvm_vcpu *vcpu)
> +{
e> +       return vmcb->control.tsc_offset +
> +                       svm_scale_tsc(vcpu, native_read_tsc());
> +}
> +
> +
>  static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
>  {
>        struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops =
>        .set_tdp_cr3 = set_tdp_cr3,
>
>        .check_intercept = svm_check_intercept,
> +
> +       .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu),
>  };
>
>  static int __init svm_init(void)
> --- .before/arch/x86/kvm/x86.c  2011-07-31 16:17:30.000000000 +0300
> +++ .after/arch/x86/kvm/x86.c   2011-07-31 16:17:30.000000000 +0300
> @@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct
>
>        /* Keep irq disabled to prevent changes to the clock */
>        local_irq_save(flags);
> -       kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
> +       tsc_timestamp = kvm_x86_ops->read_l1_tsc(vcpu);
>        kernel_ns = get_kernel_ns();
>        this_tsc_khz = vcpu_tsc_khz(v);
>        if (unlikely(this_tsc_khz == 0)) {
> @@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
>                s64 tsc_delta;
>                u64 tsc;
>
> -               kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc);
> +               tsc = kvm_x86_ops->read_l1_tsc(vcpu);
>                tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
>                             tsc - vcpu->arch.last_guest_tsc;

Technically, this call site should no longer exist; this should be
re-factored in terms of hardware TSC, not guest TSC.  I sent a patch
series which did that, but it has not yet been applied.

> @@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *
>  {
>        kvm_x86_ops->vcpu_put(vcpu);
>        kvm_put_guest_fpu(vcpu);
> -       kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +       vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
>  }
>
>  static int is_efer_nx(void)
> @@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v
>        if (hw_breakpoint_active())
>                hw_breakpoint_restore();
>
> -       kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +       vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
>
>        vcpu->mode = OUTSIDE_GUEST_MODE;
>        smp_wmb();
>
> --
> Nadav Har'El                        |      Sunday, Jul 31 2011, 29 Tammuz 5771
> nyh@xxxxxxxxxxxxxxxxxxx             |-----------------------------------------
> Phone +972-523-790466, ICQ 13349191 |Support bacteria - they're the only
> http://nadav.harel.org.il           |culture some people have!
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html