On Wed, Oct 02, 2024, Markku Ahvenjärvi wrote: > Hi Sean, > > > On Fri, Sep 20, 2024, Markku Ahvenjärvi wrote: > > > Running certain hypervisors under KVM on VMX suffered L1 hangs after > > > launching a nested guest. The external interrupts were not processed on > > > vmlaunch/vmresume due to stale VPPR, and L2 guest would resume without > > > allowing L1 hypervisor to process the events. > > > > > > The patch ensures VPPR to be updated when checking for pending > > > interrupts. > > > > This is architecturally incorrect, PPR isn't refreshed at VM-Enter. > > I looked into this and found the following from Intel manual: > > "30.1.3 PPR Virtualization > > The processor performs PPR virtualization in response to the following > operations: (1) VM entry; (2) TPR virtualization; and (3) EOI virtualization. > > ..." > > The section "27.3.2.5 Updating Non-Register State" further explains the VM > enter: > > "If the “virtual-interrupt delivery” VM-execution control is 1, VM entry loads > the values of RVI and SVI from the guest interrupt-status field in the VMCS > (see Section 25.4.2). After doing so, the logical processor first causes PPR > virtualization (Section 30.1.3) and then evaluates pending virtual interrupts > (Section 30.2.1). If a virtual interrupt is recognized, it may be delivered in > VMX non-root operation immediately after VM entry (including any specified > event injection) completes; ..." > > According to that, PPR is supposed to be refreshed at VM-Enter, or am I > missing something here? Huh, I missed that. It makes sense I guess; VM-Enter processes pending virtual interrupts, so it stands that VM-Enter would refresh PPR as well. Ugh, and looking again, KVM refreshes PPR every time it checks for a pending interrupt, including the VM-Enter case (via kvm_apic_has_interrupt()) when nested posted interrupts are in use: /* Emulate processing of posted interrupts on VM-Enter. */ if (nested_cpu_has_posted_intr(vmcs12) && kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) { vmx->nested.pi_pending = true; kvm_make_request(KVM_REQ_EVENT, vcpu); kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv); } I'm still curious as to what's different about your setup, but certainly not curious enough to hold up a fix. Anyways, back to the code, I think we can and should shoot for a more complete cleanup (on top of a minimal fix). As Chao suggested[*], the above nested posted interrupt code shouldn't exist, as KVM should handle nested posted interrupts as part of vmx_check_nested_events(), which honors event priority. And I see a way, albeit a bit of an ugly way, to avoid regressing performance when there's pending nested posted interrupt at VM-Enter. The other aspect of this code is that I don't think we need to limit the check to APICv, i.e. KVM can simply check kvm_apic_has_interrupt() after VM-Enter succeeds (the funky pre-check is necessary to read RVI from vmcs01, with the event request deferred until KVM knows VM-Enter will be successful). Arguably, that's probably more correct, as PPR virtualization should only occur if VM-Enter is successful (or at least guest past the VM-Fail checks). So, for an immediate fix, I _think_ we can do: diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index a8e7bc04d9bf..784b61c9810b 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -3593,7 +3593,8 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, * effectively unblock various events, e.g. INIT/SIPI cause VM-Exit * unconditionally. */ - if (unlikely(evaluate_pending_interrupts)) + if (unlikely(evaluate_pending_interrupts) || + kvm_apic_has_interrupt(vcpu)) kvm_make_request(KVM_REQ_EVENT, vcpu); /* and then eventually make nested_vmx_enter_non_root_mode() look like the below. Can you verify that the above fixes your setup? If it does, I'll put together a small series with that change and the cleanups I have in mind. Thanks much! diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index a8e7bc04d9bf..77f0695784d8 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -3483,7 +3483,6 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, struct vcpu_vmx *vmx = to_vmx(vcpu); struct vmcs12 *vmcs12 = get_vmcs12(vcpu); enum vm_entry_failure_code entry_failure_code; - bool evaluate_pending_interrupts; union vmx_exit_reason exit_reason = { .basic = EXIT_REASON_INVALID_STATE, .failed_vmentry = 1, @@ -3502,13 +3501,6 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, kvm_service_local_tlb_flush_requests(vcpu); - evaluate_pending_interrupts = exec_controls_get(vmx) & - (CPU_BASED_INTR_WINDOW_EXITING | CPU_BASED_NMI_WINDOW_EXITING); - if (likely(!evaluate_pending_interrupts) && kvm_vcpu_apicv_active(vcpu)) - evaluate_pending_interrupts |= vmx_has_apicv_interrupt(vcpu); - if (!evaluate_pending_interrupts) - evaluate_pending_interrupts |= kvm_apic_has_pending_init_or_sipi(vcpu); - if (!vmx->nested.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) vmx->nested.pre_vmenter_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); @@ -3591,9 +3583,13 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, * Re-evaluate pending events if L1 had a pending IRQ/NMI/INIT/SIPI * when it executed VMLAUNCH/VMRESUME, as entering non-root mode can * effectively unblock various events, e.g. INIT/SIPI cause VM-Exit - * unconditionally. + * unconditionally. Take care to pull data from vmcs01 as appropriate, + * e.g. when checking for interrupt windows, as vmcs02 is now loaded. */ - if (unlikely(evaluate_pending_interrupts)) + if ((__exec_controls_get(&vmx->vmcs01) & (CPU_BASED_INTR_WINDOW_EXITING | + CPU_BASED_NMI_WINDOW_EXITING)) || + kvm_apic_has_pending_init_or_sipi(vcpu) || + kvm_apic_has_interrupt(vcpu)) kvm_make_request(KVM_REQ_EVENT, vcpu); /* [*] https://lore.kernel.org/all/Zp%2FC5IlwfzC5DCsl@chao-email