This patch looks good, with one comment noted inline below. Are there no other call sites for kvm_get_msr() or which alias some other function to kvm_get_msr(MSR_IA32_TSC) ? Did I miss the first patch? Zach On Sun, Jul 31, 2011 at 6:48 AM, Nadav Har'El <nyh@xxxxxxxxxxxxxxxxxxx> wrote: > On Fri, Jul 29, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": >>... >> In that case, you need to distinguish between reads of the TSC MSR by >> the guest and reads by the host (as done internally to track drift and >>... >> Unfortunately, the layering currently doesn't seem to allow for this, >> and it looks like both vendor specific variants currently get this >> wrong. >>... >> 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and >> transform current uses of the code which does TSC compensation to use >> this new API. *Bonus* - no need to do double indirection through the >> generic MSR code. > > Thank you so much for this analysis! > > I propose the following two patches. The first one just fixes two small > bugs in the correct emulation in the VMX case, and the second patch solves > the originally-reported bug as you outlined above: > > The code in x86.c which used to call *_get_msr() to get the TSC no longer > does it - because *_get_msr() should only be called with the intention of > reading the msr of the current guest (L1 or L2) and returning it to that > guest. > > Instead, I added a new x86_op read_l1_tsc(vcpu), whose intention is to > always return L1's notion of the current TSC - even if the current guest > is L2. All calls in x86.c now use this new read_l1_tsc() function instead > of reading the MSR - does this look correct to you? > >> read via RDMSR which is mis-virtualized (this bug exists today in the >> SVM implementation if I am reading it correctly - cc'd Joerg to notify > > I believe that you're right. I created (in the patch below) svm.c's > read_l1_tsc() with the same code you had in svm_get_msr(), but I think > that in svm_get_msr() the code should be different in the nested case - > I think you should use svm->vmcb, not get_host_vmcb()? I didn't add this svm.c > fix to the patch, because I'm not sure at all that my understanding here is > correct. > > Zachary, Marcelo, do these patches look right to you? > Bandan, and others who case reproduce this bug - do these patches solve the > bug? > >> So you are right, this is still wrong for the case in which L1 does >> not trap TSC MSR reads. Note however, the RDTSC instruction is still >> virtualized properly, it is only the relatively rare actual TSC MSR >> read via RDMSR which is mis-virtualized (this bug exists today in the >> SVM implementation if I am reading it correctly - cc'd Joerg to notify >> him of that). That, combined with the relative importance of >> supporting a guest which does not trap on these MSR reads suggest this >> is a low priority design issue however (RDTSC still works even if the >> MSR is trapped, correct?) > > Yes, RDTSC_EXITING is separate from the MSR bitmap. I agree that these > are indeed obscure corner cases, but I think that it's still really confusing > that a function called kvm_get_msr deliberately works incorrectly (returning > always the L1 TSC) just so that it can fit some other uses. The SVM code worked > in the common case, but was still confusing why svm_get_msr() deliberately > goes to great lengths to use L1's TSC_OFFSET even when L2 is running - when > this is not how this MSR is really supposed to behave. Not to mention that one > day, somebody *will* want to use these obscure settings and be surprise that > they don't work ;-) > > Subject: [PATCH 1/2] nVMX: Fix nested TSC emulation > > This patch fixes two corner cases in nested (L2) handling of TSC-related > exits: > > 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to > the TSC MSR without an exit, then this should set L1's TSC value itself - not > offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code). > > 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore > the vmcs12.TSC_OFFSET. > > Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx> > --- > arch/x86/kvm/vmx.c | 31 +++++++++++++++++++++---------- > 1 file changed, 21 insertions(+), 10 deletions(-) > > --- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v > */ > static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) > { > - vmcs_write64(TSC_OFFSET, offset); > - if (is_guest_mode(vcpu)) > + if (is_guest_mode(vcpu)) { > /* > - * We're here if L1 chose not to trap the TSC MSR. Since > - * prepare_vmcs12() does not copy tsc_offset, we need to also > - * set the vmcs12 field here. > + * We're here if L1 chose not to trap WRMSR to TSC. According > + * to the spec, this should set L1's TSC; The offset that L1 > + * set for L2 remains unchanged, and still needs to be added > + * to the newly set TSC to get L2's TSC. > */ > - get_vmcs12(vcpu)->tsc_offset = offset - > - to_vmx(vcpu)->nested.vmcs01_tsc_offset; > + struct vmcs12 *vmcs12; > + to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset; > + /* recalculate vmcs02.TSC_OFFSET: */ > + vmcs12 = get_vmcs12(vcpu); > + vmcs_write64(TSC_OFFSET, offset + > + (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ? > + vmcs12->tsc_offset : 0)); > + } else { > + vmcs_write64(TSC_OFFSET, offset); > + } > } > > static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) > @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc > > set_cr4_guest_host_mask(vmx); > > - vmcs_write64(TSC_OFFSET, > - vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); > + if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING) > + vmcs_write64(TSC_OFFSET, > + vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); > + else > + vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); > > if (enable_vpid) { > /* > @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm > > load_vmcs12_host_state(vcpu, vmcs12); > > - /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */ > + /* Update TSC_OFFSET if TSC was changed while L2 ran */ > vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); > > /* This is needed for same reason as it was needed in prepare_vmcs02 */ > > Subject: [PATCH 2/2] nVMX: L1 TSC handling > > KVM assumed in several places that reading the TSC MSR returns the value for > L1. This is incorrect, because when L2 is running, the correct TSC read exit > emulation is to return L2's value. > > We therefore add a new x86_ops function, read_l1_tsc, to use in places that > specifically need to read the L1 TSC, NOT the TSC of the current level of > guest. > > Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx> > --- > arch/x86/include/asm/kvm_host.h | 2 ++ > arch/x86/kvm/svm.c | 9 +++++++++ > arch/x86/kvm/vmx.c | 17 +++++++++++++++++ > arch/x86/kvm/x86.c | 8 ++++---- > 4 files changed, 32 insertions(+), 4 deletions(-) > > --- .before/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 > @@ -636,6 +636,8 @@ struct kvm_x86_ops { > struct x86_instruction_info *info, > enum x86_intercept_stage stage); > > + u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu); > + > const struct trace_print_flags *exit_reasons_str; > }; > > --- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > @@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void) > } > > /* > + * Like guest_read_tsc, but always returns L1's notion of the timestamp > + * counter, even if a nested guest (L2) is currently running. > + */ > +u64 vmx_read_l1_tsc(kvm_vcpu *vcpu) > +{ > + u64 host_tsc, tsc_offset; > + > + rdtscll(host_tsc); > + tsc_offset = is_guest_mode(vcpu) ? > + to_vmx(vcpu)->nested.vmcs01_tsc_offset : > + vmcs_read64(TSC_OFFSET); > + return host_tsc + tsc_offset; > +} > + > +/* > * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ > * ioctl. In this case the call-back should update internal vmx state to make > * the changes effective. > @@ -7070,6 +7085,8 @@ static struct kvm_x86_ops vmx_x86_ops = > .set_tdp_cr3 = vmx_set_cr3, > > .check_intercept = vmx_check_intercept, > + > + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), > }; > > static int __init vmx_init(void) > --- .before/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 > @@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct > return 0; > } > > +u64 svm_read_l1_tsc(kvm_vcpu *vcpu) > +{ e> + return vmcb->control.tsc_offset + > + svm_scale_tsc(vcpu, native_read_tsc()); > +} > + > + > static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data) > { > struct vcpu_svm *svm = to_svm(vcpu); > @@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops = > .set_tdp_cr3 = set_tdp_cr3, > > .check_intercept = svm_check_intercept, > + > + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), > }; > > static int __init svm_init(void) > --- .before/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 > @@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct > > /* Keep irq disabled to prevent changes to the clock */ > local_irq_save(flags); > - kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); > + tsc_timestamp = kvm_x86_ops->read_l1_tsc(vcpu); > kernel_ns = get_kernel_ns(); > this_tsc_khz = vcpu_tsc_khz(v); > if (unlikely(this_tsc_khz == 0)) { > @@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu > s64 tsc_delta; > u64 tsc; > > - kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc); > + tsc = kvm_x86_ops->read_l1_tsc(vcpu); > tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : > tsc - vcpu->arch.last_guest_tsc; Technically, this call site should no longer exist; this should be re-factored in terms of hardware TSC, not guest TSC. I sent a patch series which did that, but it has not yet been applied. > @@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu * > { > kvm_x86_ops->vcpu_put(vcpu); > kvm_put_guest_fpu(vcpu); > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); > } > > static int is_efer_nx(void) > @@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v > if (hw_breakpoint_active()) > hw_breakpoint_restore(); > > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); > > vcpu->mode = OUTSIDE_GUEST_MODE; > smp_wmb(); > > -- > Nadav Har'El | Sunday, Jul 31 2011, 29 Tammuz 5771 > nyh@xxxxxxxxxxxxxxxxxxx |----------------------------------------- > Phone +972-523-790466, ICQ 13349191 |Support bacteria - they're the only > http://nadav.harel.org.il |culture some people have! > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html