On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": > > > No, both patches are wrong. > > > > > The correct fix is to make kvm_get_msr() return the L1 guest TSC at all > > times. > > > We are serving the L1 guest in this hypervisor, not the L2 guest, and so > > > > > should never read its TSC for any reason. I've been thinking about how to correctly solve this problem, and about this statement that you made, that get_read_tsc() should *always* return L1's TSC, even while running L2. We have two quite-different uses for the TSC-related functions: 1. They are used in vmx.c to emulate *architectural* instructions, namely RDTSC, RDMSR of TSC, and WRMSR of TSC. 2. They are used in x86.c for internal KVM uses, adjusting the TSC for various reasons (most of which, I have to admit, I don't understand). You said above that kvm_get_msr(), calling guest_read_tsc(), should always return L1's TSC, even when running L2. However, I believe that that for use #1 - the *architecturally correct* emulation of RDTSC and RDMSR when L2 is running, the current nested-agnostic code - which ends up returning L2's TSC when L2 is running - is the correct thing to do. Look at what happens when: When we get an exit to L0 when L2 run RDMSR TSC: This may exit to L1, if L1 asked in the MSR Bitmap. Done correctly. If hasn't exited, the spec says we need to return L1's TSC plus the TSC_OFFSET that L1 set for L2 (if L1 asked for TSC_OFFSETTING). In other words, we should return L0's TSC offsetted by the sum of vmcs01 and vmcs12 tsc_offsets. This sum is exactly what we have in the current (vmcs02) TSC_OFFSET. So the existing guest_read_tsc, which returns rdtscll() + vmcs_read64(TSC_OFFSET), does the correct thing - also for nested. When we get an exit to L0 when L2 runs RDTSC: This may exit to L1 if L1 asked for TSC_EXITING. Done correctly. There is no "else" here - we would get an EXIT_REASON_RDTSC only if L1 asked for it (because KVM itself doesn't use TSC_EXITING). Regarding emulation of WRMSR TSC, I actually had an error in the code which I correct in a patch below: According to the Intel spec, it turns out that if WRMSR of the TSC does not cause an exit, then it should change the host TSC itself, ignoring TSC_OFFSET. This is very strange, because it means that if the host doesn't trap MSR read and write of the TSC, then if a guest writes TSC and then immediately reads it, it won't read what is just wrote, but rather what it wrote pluss the TSC_OFFSET (because the TSC_OFFSET is added on read, but not subtracted on write). I don't know why the Intel spec specifies this, but it does - and it's not what I emulated to I fixed that now (in the patch below). But I doubt this is the problem that Bandan and others saw? So now, if I'm right, after this patch, the *architectural* uses of guest_read_tsc() and vmx_write_tsc_offset() to emulate RDTSR and RD/WRMSR, are done correctly. But the problem is that these functions are also used in KVM for other things. Ideally, whomever calls kvm_read_msr()/kvm_write_msr() should realize that it gets the value for the current running guest, which might be L1 or L2, just like if that guest would read or write the MSR. Ideally, KVM might read the MSR, change it and write it back, and everything would be correct as both read and write code are done correctly. The question is, is some of the code - perhaps in kvm_guest_time_update(), kvm_arch_vcpu_load(), kvm_arch_vcpu_put() or vcpu_enter_guest() (these are functions which touch the TSC), which only work correctly when reading/writing L1 TSC and can't work correctly if L2 TSC is read and written? I'm afraid I don't know enough about what these functions are doing, to understand what, if anything, they are doing wrong. Bandan and others who can reproduce this bug - I attach below a patch which should correct some architectural emulation issues I discovered while reading the code again now - especially of TSC MSR write. Can you please try if this solves anything? I'm not too optimistic, but since the code was wrong in this regard, it's at least worth a try. Thanks, Nadav. Subject: [PATCH] nVMX: Fix nested TSC handling This patch fixes several areas where nested VMX did not correctly emulated TSC handling. Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx> --- arch/x86/kvm/vmx.c | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) --- .before/arch/x86/kvm/vmx.c 2011-07-28 14:02:52.000000000 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-07-28 14:02:52.000000000 +0300 @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v */ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) { - vmcs_write64(TSC_OFFSET, offset); - if (is_guest_mode(vcpu)) + if (is_guest_mode(vcpu)) { /* - * We're here if L1 chose not to trap the TSC MSR. Since - * prepare_vmcs12() does not copy tsc_offset, we need to also - * set the vmcs12 field here. + * We're here if L1 chose not to trap WRMSR to TSC. According + * to the spec, this should set L1's TSC; The offset that L1 + * set for L2 remains unchanged, and still needs to be added + * to the newly set TSC to get L2's TSC. */ - get_vmcs12(vcpu)->tsc_offset = offset - - to_vmx(vcpu)->nested.vmcs01_tsc_offset; + struct vmcs12 *vmcs12; + to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset; + /* recalculate vmcs02.TSC_OFFSET: */ + vmcs12 = get_vmcs12(vcpu); + vmcs_write64(TSC_OFFSET, offset + + (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ? + vmcs12->tsc_offset : 0)); + } else { + vmcs_write64(TSC_OFFSET, offset); + } } static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc set_cr4_guest_host_mask(vmx); - vmcs_write64(TSC_OFFSET, - vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING) + vmcs_write64(TSC_OFFSET, + vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + else + vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); if (enable_vpid) { /* @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm load_vmcs12_host_state(vcpu, vmcs12); - /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */ + /* Update TSC_OFFSET if TSC was changed while L2 ran */ vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); /* This is needed for same reason as it was needed in prepare_vmcs02 */ -- Nadav Har'El | Thursday, Jul 28 2011, 26 Tammuz 5771 nyh@xxxxxxxxxxxxxxxxxxx |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |Just remember that if the world didn't http://nadav.harel.org.il |suck, we would all fall off. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html