Hi Paolo, Marc, Eric and others, Did you have a chance to take a look at this v3 patchset? Just to remind the context: this patchset implements a solution I proposed in [1], only that it doesn't introduce the additional "pending oneshot" state (which was Marc's concern in [2]): I realized that this extra state is not really needed, as we can just unconditionally notify the resamplefd once again at unmask, with the same end result. [1] https://lore.kernel.org/kvm/72e40c17-e5cd-1ffd-9a38-00b47e1cbd8e@xxxxxxxxxxxx/ [2] https://lore.kernel.org/kvm/87o7wrug0w.wl-maz@xxxxxxxxxx/ On 8/18/22 22:26, Dmytro Maluka wrote: > KVM irqfd based emulation of level-triggered interrupts doesn't work > quite correctly in some cases, particularly in the case of interrupts > that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT). > Such an interrupt is acked to the device in its threaded irq handler, > i.e. later than it is acked to the interrupt controller (EOI at the end > of hardirq), not earlier. > > Linux keeps such interrupt masked until its threaded handler finishes, > to prevent the EOI from re-asserting an unacknowledged interrupt. > However, with KVM + vfio (or whatever is listening on the resamplefd) > we always notify resamplefd at the EOI, so vfio prematurely unmasks the > host physical IRQ, thus a new physical interrupt is fired in the host. > This extra interrupt in the host is not a problem per se. The problem is > that it is unconditionally queued for injection into the guest, so the > guest sees an extra bogus interrupt. [*] > > There are observed at least 2 user-visible issues caused by those > extra erroneous interrupts for a oneshot irq in the guest: > > 1. System suspend aborted due to a pending wakeup interrupt from > ChromeOS EC (drivers/platform/chrome/cros_ec.c). > 2. Annoying "invalid report id data" errors from ELAN0000 touchpad > (drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg > every time the touchpad is touched. > > The core issue here is that by the time when the guest unmasks the IRQ, > the physical IRQ line is no longer asserted (since the guest has > acked the interrupt to the device in the meantime), yet we > unconditionally inject the interrupt queued into the guest by the > previous resampling. So to fix the issue, we need a way to detect that > the IRQ is no longer pending, and cancel the queued interrupt in this > case. > > With IOAPIC we are not able to probe the physical IRQ line state > directly (at least not if the underlying physical interrupt controller > is an IOAPIC too), so in this patch series we use irqfd resampler for > that. Namely, instead of injecting the queued interrupt, we just notify > the resampler that this interrupt is done. If the IRQ line is actually > already deasserted, we are done. If it is still asserted, a new > interrupt will be shortly triggered through irqfd and injected into the > guest. > > In the case if there is no irqfd resampler registered for this IRQ, we > cannot fix the issue, so we keep the existing behavior: immediately > unconditionally inject the queued interrupt. > > This patch series fixes the issue for x86 IOAPIC only. In the long run, > we can fix it for other irqchips and other architectures too, possibly > taking advantage of reading the physical state of the IRQ line, which is > possible with some other irqchips (e.g. with arm64 GIC, maybe even with > the legacy x86 PIC). > > [*] In this description we assume that the interrupt is a physical host > interrupt forwarded to the guest e.g. by vfio. Potentially the same > issue may occur also with a purely virtual interrupt from an > emulated device, e.g. if the guest handles this interrupt, again, as > a oneshot interrupt. > > > v3: > - Completely reworked: instead of postponing resamplefd notify until > unmask (to avoid extra interrupts in the host), resample the pending > status at unmask to avoid erroneous propagation of those extra > interrupts to the guest. > Thanks to Marc Zyngier for helping to identify the core issue, which > resulted in a simpler and probably more sensible implementation > (even though Marc's concern about presenting inaccurate pending > status to the guest is a non-issue in the case of IOAPIC, since > IOAPIC doesn't present this information anyway). > > v2: > - Fixed compilation failure on non-x86: mask_notifier_list moved from > x86 "struct kvm_arch" to generic "struct kvm". > - kvm_fire_mask_notifiers() also moved from x86 to generic code, even > though it is not called on other architectures for now. > - Instead of kvm_irq_is_masked() implemented > kvm_register_and_fire_irq_mask_notifier() to fix potential race > when reading the initial IRQ mask state. > - Renamed for clarity: > - irqfd_resampler_mask() -> irqfd_resampler_mask_notify() > - kvm_irq_has_notifier() -> kvm_irq_has_ack_notifier() > - resampler->notifier -> resampler->ack_notifier > - Reorganized code in irqfd_resampler_ack() and > irqfd_resampler_mask_notify() to make it easier to follow. > - Don't follow unwanted "return type on separate line" style for > irqfd_resampler_mask_notify(). > > Dmytro Maluka (2): > KVM: irqfd: Make resampler_list an RCU list > KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking > > arch/x86/kvm/ioapic.c | 36 ++++++++++++++++++++++++++++-- > include/linux/kvm_host.h | 9 ++++++++ > include/linux/kvm_irqfd.h | 2 +- > virt/kvm/eventfd.c | 47 ++++++++++++++++++++++++++++++++------- > 4 files changed, 83 insertions(+), 11 deletions(-) >