On Thu, Aug 11, 2022, Paolo Bonzini wrote: > kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore > it cannot sleep. Writing to guest memory is therefore forbidden, but it > can happen on AMD processors if kvm_check_nested_events() causes a vmexit. > > Fortunately, all events that are caught by kvm_check_nested_events() are > also recognized by kvm_vcpu_has_events() through vendor callbacks such as > kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so > remove the call and postpone the actual processing to vcpu_block(). > > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > --- > arch/x86/kvm/x86.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 5e9358ea112b..9226fd536783 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -10639,6 +10639,17 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) > return 1; > } > > + if (is_guest_mode(vcpu)) { > + /* > + * Evaluate nested events before exiting the halted state. > + * This allows the halt state to be recorded properly in > + * the VMCS12's activity state field (AMD does not have > + * a similar field and a vmexit always causes a spurious > + * wakeup from HLT). > + */ > + kvm_check_nested_events(vcpu); > + } > + > if (kvm_apic_accept_events(vcpu) < 0) > return 0; Oof, this ends up yielding a really confusing code sequence. kvm_apic_accept_events() has its own kvm_check_nested_events(), but has code to snapshot pending INITs/SIPIs _before_ the call. Unpacked, KVM ends up with: if (is_guest_mode(vcpu)) kvm_check_nested_events(vcpu); /* * Read pending events before calling the check_events * callback. */ pe = smp_load_acquire(&apic->pending_events); if (!pe) return 0; if (is_guest_mode(vcpu)) { r = kvm_check_nested_events(vcpu); if (r < 0) return r == -EBUSY ? 0 : r; } if (kvm_vcpu_latch_init(vcpu)) { WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED); if (test_bit(KVM_APIC_SIPI, &pe)) clear_bit(KVM_APIC_SIPI, &apic->pending_events); return 0; } if (test_bit(KVM_APIC_INIT, &pe)) { clear_bit(KVM_APIC_INIT, &apic->pending_events); kvm_vcpu_reset(vcpu, true); if (kvm_vcpu_is_bsp(apic->vcpu)) vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; else vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED; } if (test_bit(KVM_APIC_SIPI, &pe)) { clear_bit(KVM_APIC_SIPI, &apic->pending_events); if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) { /* evaluate pending_events before reading the vector */ smp_rmb(); sipi_vector = apic->sipi_vector; static_call(kvm_x86_vcpu_deliver_sipi_vector)(vcpu, sipi_vector); vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; } } which on the surface makes this code look broken, e.g. if kvm_check_nested_events() _needs_ to be after the pending_events snapshot is taken, why is it safe to add a kvm_check_nested_events() call immediately before the snapshot? In reality, it's just a bunch of noise because the pending events snapshot is completely unnecessary and subtly relies on INIT+SIPI being blocked after VM-Exit on VMX (and SVM, but it's more important for VMX). In particular, testing "pe" after VM-Exit is nonsensical. On VMX, events are consumed if they trigger VM-Exit, i.e. processing INIT/SIPI is flat out wrong if the INIT/SIPI was the direct cause of VM-Exit. On SVM, events are left pending, so if any pending INIT/SIPI will still be there. The VMX code works because kvm_vcpu_latch_init(), a.k.a. "is INIT blocked", is always true after VM-Exit since INIT is always blocked in VMX root mode. Ditto for the conditional clearing of SIPI; the CPU can't be in wait-for-SIPI immediately after VM-Exit and so dropping SIPIs ends up being architecturally ok. I'll add a patch to drop the snapshot code, assuming I'm not missing something even more subtle...