On 1/16/2025 10:50 PM, Sean Christopherson wrote:
On Thu, Jan 16, 2025, Kai Huang wrote:
On Mon, 2025-01-13 at 10:09 +0800, Binbin Wu wrote:
Lazy check for pending APIC EOI when In-kernel IOAPIC
-----------------------------------------------------
In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor
accelerates write to APIC EOI register and does not trap if the interrupt
is edge-triggered. So there is a workaround by lazy check for pending APIC
EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no
pending APIC EOI.
KVM is also not be able to intercept EOI for TDX guests.
- When APICv is enabled
The code of lazy check for pending APIC EOI doesn't work for TDX because
KVM can't get the status of real IRR and ISR, and the values are 0s in
vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return
false. So the RTC pending EOI will always be cleared when ioapic_set_irq()
is called for RTC. Then userspace may miss the coalesced RTC interrupts.
- When When APICv is disabled
ioapic_lazy_update_eoi() will not be called,then pending EOI status for
RTC will not be cleared after setting and this will mislead userspace to
see coalesced RTC interrupts.
Options:
- Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC.
- Leave it as it is, but the use of RTC may not be accurate.
Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
upon EOI for level-triggered interrupt. I don't know how does KVM works with
userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
irqchip split for TDX guest" seems not right.
Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an
I/O APIC and the "split" model is the way to concoct such a setup. With a "full"
IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less
nonsensical on TDX because it's fully virtual world, i.e. there's no reason to
emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.).
Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because
level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully
emulated (see below).
I think the problem is level-triggered interrupt,
Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap,
i.e. all EOIs are accelerated and never trigger exits.
Yes, and I think it needs to add the description about it and the
level-trigger interrupt in the commit message of some patch.
so I think another option is to reject level-triggered interrupt for TDX guest.
This is a "don't do that, it will hurt" situation. With a sane VMM, the level-ness
of GSIs is controlled by the guest. For GSIs that are routed through the I/O APIC,
the level-ness is determined by the corresponding Redirection Table entry. For
"GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace
wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is
dictated by the MSI itself, which again is guest controlled.
If the guest induces generation of a level-triggered interrupt, the VMM is left
with the choice of dropping the interrupt, sending it as-is, or converting it to
an edge-triggered interrupt. Ditto for KVM. All of those options will make the
guest unhappy.
So while it _might_ make debugging broken guests either, I don't think it's worth
the complexity to try and prevent the VMM/guest from sending level-triggered
GSI-routed interrupts. It'd be a bit of a whack-a-mole and there's no architectural
behavior KVM can provide that's better than sending the interrupt and hoping for
the best.
Currently, KVM doesn't do anything special if the guest send level-triggered
interrupts for TDX guests.
QEMU has a patch to set the eoi_intercept_unsupported to true for tdx guests.
https://lore.kernel.org/kvm/20241105062408.3533704-41-xiaoyao.li@xxxxxxxxx/
And it seems that the level_trigger_unsupported info will be passed to guest
via ACPI table. I didn't dig deep into it, I suppose with the information,
guests will not send level-triggered GSI interrupts?