Failing VMRUNs (immediate #VMEXIT with error code VMEXIT_INVALID) for KVM guests on top of Hyper-V are observed when KVM does SMM emulation. The root cause of the problem appears to be an overly strict CR3 VMCB check done by Hyper-V. Here's an example of a CR state which triggers the failure: kvm_amd: vmpl: 0 cpl: 0 efer: 0000000000001000 kvm_amd: cr0: 0000000000050032 cr2: ffff92dcf8601000 kvm_amd: cr3: 0000000100232003 cr4: 0000000000000040 CR3 value may look a bit weird as it has non-zero PCID bits set as well as non-zero bits in the upper half but the processor is not in long mode. This, however, is a valid state upon entering SMM from a long mode context with PCID enabled and should not be causing VMEXIT_INVALID. APM says that VMEXIT_INVALID is triggered when "Any MBZ bit of CR3 is set.". In CR3 format the only MBZ bits are those above MAXPHYADDR, the rest is just "Reserved". Place a temporary workaround in KVM to avoid putting problematic CR3 values into VMCB when KVM runs on top of Hyper-V. Enable CR3 READ/WRITE intercepts to make sure guest is not observing side-effects of the mangling. Also, do not overwrite 'vcpu->arch.cr3' with mangled 'save.cr3' value when CR3 intercepts are enabled (and thus a possible CR3 update from the guest would change 'vcpu->arch.cr3' instantly). The workaround is only needed until Hyper-V gets fixed. Reported-by: Daan De Meyer <daan.j.demeyer@xxxxxxxxx> Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> --- - The patch serves mostly documentational purposes, I don't expect it to be merged to the mainline. Hyper-V *is* supposed to get fixed but the timeline is unclear at this point. As Azure is a fairly popular platform for running nested KVM, it is possible that the bug will get discovered again (running OVMF based guest is a good starting point!). --- arch/x86/kvm/svm/svm.c | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 272d5ed37ce7..6ff7cbcb5cac 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -41,6 +41,7 @@ #include <asm/traps.h> #include <asm/reboot.h> #include <asm/fpu/api.h> +#include <asm/hypervisor.h> #include <trace/events/ipi.h> @@ -3497,7 +3498,7 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) if (!sev_es_guest(vcpu->kvm)) { if (!svm_is_intercept(svm, INTERCEPT_CR0_WRITE)) vcpu->arch.cr0 = svm->vmcb->save.cr0; - if (npt_enabled) + if (npt_enabled && !svm_is_intercept(svm, INTERCEPT_CR3_WRITE)) vcpu->arch.cr3 = svm->vmcb->save.cr3; } @@ -4264,6 +4265,33 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, cr3 = root_hpa; } +#if IS_ENABLED(CONFIG_HYPERV) + /* + * Workaround an issue in Hyper-V hypervisor where 'reserved' bits are treated + * as MBZ failing VMRUN. + */ + if (hypervisor_is_type(X86_HYPER_MS_HYPERV) && likely(npt_enabled)) { + unsigned long cr3_unmod = cr3; + + /* + * Bits MAXPHYADDR:63 are MBZ but bits 32:MAXPHYADDR-1 are just 'reserved' + * in !long mode. + */ + if (!is_long_mode(vcpu)) + cr3 &= ~rsvd_bits(32, cpuid_maxphyaddr(vcpu) - 1); + + if (!kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE)) + cr3 &= ~X86_CR3_PCID_MASK; + + if (cr3 != cr3_unmod && !svm_is_intercept(svm, INTERCEPT_CR3_READ)) { + svm_set_intercept(svm, INTERCEPT_CR3_READ); + svm_set_intercept(svm, INTERCEPT_CR3_WRITE); + } else if (cr3 == cr3_unmod && svm_is_intercept(svm, INTERCEPT_CR3_READ)) { + svm_clr_intercept(svm, INTERCEPT_CR3_READ); + svm_clr_intercept(svm, INTERCEPT_CR3_WRITE); + } + } +#endif svm->vmcb->save.cr3 = cr3; vmcb_mark_dirty(svm->vmcb, VMCB_CR); } -- 2.44.0