On Thu, Jan 16, 2025, Kai Huang wrote: > On Mon, 2025-01-13 at 10:09 +0800, Binbin Wu wrote: > > Lazy check for pending APIC EOI when In-kernel IOAPIC > > ----------------------------------------------------- > > In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor > > accelerates write to APIC EOI register and does not trap if the interrupt > > is edge-triggered. So there is a workaround by lazy check for pending APIC > > EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no > > pending APIC EOI. > > KVM is also not be able to intercept EOI for TDX guests. > > - When APICv is enabled > > The code of lazy check for pending APIC EOI doesn't work for TDX because > > KVM can't get the status of real IRR and ISR, and the values are 0s in > > vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return > > false. So the RTC pending EOI will always be cleared when ioapic_set_irq() > > is called for RTC. Then userspace may miss the coalesced RTC interrupts. > > - When When APICv is disabled > > ioapic_lazy_update_eoi() will not be called,then pending EOI status for > > RTC will not be cleared after setting and this will mislead userspace to > > see coalesced RTC interrupts. > > Options: > > - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC. > > - Leave it as it is, but the use of RTC may not be accurate. > > Looking at the code, it seems KVM only traps EOI for level-triggered interrupt > for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told > upon EOI for level-triggered interrupt. I don't know how does KVM works with > userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force > irqchip split for TDX guest" seems not right. Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an I/O APIC and the "split" model is the way to concoct such a setup. With a "full" IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less nonsensical on TDX because it's fully virtual world, i.e. there's no reason to emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.). Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully emulated (see below). > I think the problem is level-triggered interrupt, Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap, i.e. all EOIs are accelerated and never trigger exits. > so I think another option is to reject level-triggered interrupt for TDX guest. This is a "don't do that, it will hurt" situation. With a sane VMM, the level-ness of GSIs is controlled by the guest. For GSIs that are routed through the I/O APIC, the level-ness is determined by the corresponding Redirection Table entry. For "GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is dictated by the MSI itself, which again is guest controlled. If the guest induces generation of a level-triggered interrupt, the VMM is left with the choice of dropping the interrupt, sending it as-is, or converting it to an edge-triggered interrupt. Ditto for KVM. All of those options will make the guest unhappy. So while it _might_ make debugging broken guests either, I don't think it's worth the complexity to try and prevent the VMM/guest from sending level-triggered GSI-routed interrupts. It'd be a bit of a whack-a-mole and there's no architectural behavior KVM can provide that's better than sending the interrupt and hoping for the best.