Re: [PATCH 2/2] KVM: x86: Clear X2APIC_ICR_UNUSED_12 after APIC-write VM-exit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 06, 2023, Tao Su wrote:
> On Tue, Sep 05, 2023 at 04:03:36PM -0700, Sean Christopherson wrote:
> > +Suravee
> > 
> > On Mon, Sep 04, 2023, Tao Su wrote:
> > > When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR
> > > MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi()
> > > thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay,
> > > but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit.
> > > 
> > > Since bit12 of ICR is no longer BUSY bit but UNUSED bit in x2APIC mode,
> > > and SDM has no detail about how hardware will handle the UNUSED bit12
> > > set, we tested on Intel CPU (SRF/GNR) with IPI virtualization and found
> > > the UNUSED bit12 was also cleared by hardware without #GP. Therefore,
> > > the clearing of bit12 should be still kept being consistent with the
> > > hardware behavior.
> > 
> > I'm confused.  If hardware clears the bit, then why is it set in the vAPIC page
> > after a trap-like APIC-write VM-Exit?  In other words, how is this not a ucode
> > or hardware bug?
> 
> Sorry, I didn't describe it clearly.
> 
> On bare-metal, bit12 of ICR MSR will be cleared after setting this bit.
> 
> If bit12 is set in guest, the bit is not cleared in the vAPIC page after APIC-write
> VM-Exit. So whether to clear bit12 in vAPIC page needs to be considered.

I got that, the behavior just seems odd to me.  And I'm grumpy that Intel punted
the problem to software.  But the SDM specifically calls out that this is the
correct behavior :-/

Specifically, in the context of IPI virtualization:

  If ECX contains 830H, a general-protection fault occurs if any of bits 31:20,
  17:16, or 13 of EAX is non-zero.

and

  If ECX contains 830H, the processor then checks the value of VICR to determine
  whether the following are all true:

  Bits 19:18 (destination shorthand) are 00B (no shorthand).
  Bit 15 (trigger mode) is 0 (edge).
  Bit 12 (unused) is 0.
  Bit 11 (destination mode) is 0 (physical).
  Bits 10:8 (delivery mode) are 000B (fixed).
  
  If all of the items above are true, the processor performs IPI virtualization
  using the 8-bit vector in byte 0 of VICR and the 32-bit APIC ID in VICR[63:32]
  (see Section 30.1.6). Otherwise, the logical processor causes an APIC-write VM
  exit (see Section 30.4.3.3).  If ECX contains 830H, the processor then checks
  the value of VICR to determine whether the following are all true:

I.e. the "unused" busy bit must be zero.  The part that makes me grumpy is that
hardware does check that other reserved bits are actually zero:

  If special processing applies, no general-protection exception is produced due
  to the fact that the local APIC is in xAPIC mode. However, WRMSR does perform
  the normal reserved-bit checking:
   - If ECX contains 808H or 83FH, a general-protection fault occurs if either EDX or EAX[31:8] is non-zero.
   - If ECX contains 80BH, a general-protection fault occurs if either EDX or EAX is non-zero.
   - If ECX contains 830H, a general-protection fault occurs if any of bits 31:20, 17:16, or 13 of EAX is non-zero.

Which implies that the hardware *does* enforce all the other reserved bits, but
punted bit 12 to the hypervisor :-(

That said, I think we have an "out".  Under the x2APIC section, regarding ICR,
the SDM also says:

  It remains readable only to aid in debugging; however, software should not
  assume the value returned by reading the ICR is the last written value.

I.e. KVM basically has free reign to do whatever it wants, so long as it doesn't
confuse userspace or break KVM's ABI.  As much as I want to say "do nothing",
clearing bit 12 so that it reads back as '0' is the safer approach.  Just please
don't add a new #define for, it's far easier to understand what's going on if we
just use APIC_ICR_BUSY, especially given that I highly doubt the bit will actually
be repurposed for something new.

FWIW, I also suspect that hardware isn't clearing the busy bit per se, I suspect
that hardware simply reads the bit as zero.

Side topic, unless I'm blind, KVM is missing the reserved bits #GP checks for ICR
bits bits 31:20, 17:16, and 13.



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux