+Suravee On Mon, Sep 04, 2023, Tao Su wrote: > When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR > MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi() > thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay, > but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit. > > Since bit12 of ICR is no longer BUSY bit but UNUSED bit in x2APIC mode, > and SDM has no detail about how hardware will handle the UNUSED bit12 > set, we tested on Intel CPU (SRF/GNR) with IPI virtualization and found > the UNUSED bit12 was also cleared by hardware without #GP. Therefore, > the clearing of bit12 should be still kept being consistent with the > hardware behavior. I'm confused. If hardware clears the bit, then why is it set in the vAPIC page after a trap-like APIC-write VM-Exit? In other words, how is this not a ucode or hardware bug? Suravee, can you confirm what happens on AMD with x2AVIC? Does hardware *always* clear the busy bit if it's set by the guest? If so, then we could "optimize" avic_incomplete_ipi_interception() to skip the busy check, e.g. diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index cfc8ab773025..4bf0bb250147 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -513,7 +513,7 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu) * in which case KVM needs to emulate the ICR write as well in * order to clear the BUSY flag. */ - if (icrl & APIC_ICR_BUSY) + if (!apic_x2apic_mode(apic) && (icrl & APIC_ICR_BUSY)) kvm_apic_write_nodecode(vcpu, APIC_ICR); else kvm_apic_send_ipi(apic, icrl, icrh); > Fixes: 5413bcba7ed5 ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode") > Signed-off-by: Tao Su <tao1.su@xxxxxxxxxxxxxxx> > Tested-by: Yi Lai <yi1.lai@xxxxxxxxx> > --- > arch/x86/kvm/lapic.c | 27 ++++++++++++++++++++------- > 1 file changed, 20 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index a983a16163b1..09a376aeb4a0 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1482,8 +1482,17 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) > { > struct kvm_lapic_irq irq; > > - /* KVM has no delay and should always clear the BUSY/PENDING flag. */ > - WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); > + /* > + * In non-x2apic mode, KVM has no delay and should always clear the > + * BUSY/PENDING flag. In x2apic mode, KVM should clear the unused bit12 > + * of ICR since hardware will also clear this bit. Although > + * APIC_ICR_BUSY and X2APIC_ICR_UNUSED_12 are same, they mean different > + * things in different modes. > + */ > + if (!apic_x2apic_mode(apic)) > + WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); > + else > + WARN_ON_ONCE(icr_low & X2APIC_ICR_UNUSED_12); NAK to the new name, KVM is absolutely not going to zero an arbitrary "unused" bit. If Intel wants to reclaim bit 12 for something useful in the future, then Intel can ship CPUs that don't touch the "reserved" bit, and deal with all the fun of finding and updating all software that unnecessarily sets the busy bit in x2apic mode. If we really want to pretend that Intel has more than a snowball's chance in hell of doing something useful with bit 12, then the right thing to do in KVM is to ignore the bit entirely and let the guest keep the pieces, e.g. diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 113ca9661ab2..36ec195a3339 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1473,8 +1473,13 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) { struct kvm_lapic_irq irq; - /* KVM has no delay and should always clear the BUSY/PENDING flag. */ - WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); + /* + * KVM has no delay and should always clear the BUSY/PENDING flag. + * The flag doesn't exist in x2APIC mode; both the SDM and APM state + * that the flag "Must Be Zero", but neither Intel nor AMD enforces + * that (or any other reserved bits in ICR). + */ + WARN_ON_ONCE(!apic_x2apic_mode(apic) && (icr_low & APIC_ICR_BUSY)); irq.vector = icr_low & APIC_VECTOR_MASK; irq.delivery_mode = icr_low & APIC_MODE_MASK; @@ -3113,8 +3118,6 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr) int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data) { - data &= ~APIC_ICR_BUSY; - kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32)); kvm_lapic_set_reg64(apic, APIC_ICR, data); trace_kvm_apic_write(APIC_ICR, data); diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index cfc8ab773025..4bf0bb250147 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -513,7 +513,7 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu) * in which case KVM needs to emulate the ICR write as well in * order to clear the BUSY flag. */ - if (icrl & APIC_ICR_BUSY) + if (!apic_x2apic_mode(apic) && (icrl & APIC_ICR_BUSY)) kvm_apic_write_nodecode(vcpu, APIC_ICR); else kvm_apic_send_ipi(apic, icrl, icrh);