+other KVM maintainers On Fri, Jan 12, 2024, Yi Wang wrote: > From: Yi Wang <foxywang@xxxxxxxxxxx> > > We found that it may cost more than 20 milliseconds very accidentally > to enable cap of KVM_CAP_SPLIT_IRQCHIP on a host which has many vms > already. > > The reason is that when vmm(qemu/CloudHypervisor) invokes > KVM_CAP_SPLIT_IRQCHIP kvm will call synchronize_srcu_expedited() and > might_sleep and kworker of srcu may cost some delay during this period. might_sleep() yielding is not justification for changing KVM. That's more or less saying "my task got preempted and took longer to run". Well, yeah. > Since this happens during creating vm, it's no need to synchronize srcu > now 'cause everything is not ready(vcpu/irqfd) and none uses irq_srcu now. > > Signed-off-by: Yi Wang <foxywang@xxxxxxxxxxx> > --- > arch/x86/kvm/irq_comm.c | 2 +- > include/linux/kvm_host.h | 2 ++ > virt/kvm/irqchip.c | 3 ++- > 3 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c > index 16d076a1b91a..37c92b7486c7 100644 > --- a/arch/x86/kvm/irq_comm.c > +++ b/arch/x86/kvm/irq_comm.c > @@ -394,7 +394,7 @@ static const struct kvm_irq_routing_entry empty_routing[] = {}; > > int kvm_setup_empty_irq_routing(struct kvm *kvm) > { > - return kvm_set_irq_routing(kvm, empty_routing, 0, 0); > + return kvm_set_irq_routing(kvm, empty_routing, 0, NONEED_SYNC_SRCU); > } > > void kvm_arch_post_irq_routing_update(struct kvm *kvm) > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 4944136efaa2..a46370cca355 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1995,6 +1995,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm, > > #define KVM_MAX_IRQ_ROUTES 4096 /* might need extension/rework in the future */ > > +#define NONEED_SYNC_SRCU (1U << 0) > + > bool kvm_arch_can_set_irq_routing(struct kvm *kvm); > int kvm_set_irq_routing(struct kvm *kvm, > const struct kvm_irq_routing_entry *entries, > diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c > index 1e567d1f6d3d..cea5c43c1a49 100644 > --- a/virt/kvm/irqchip.c > +++ b/virt/kvm/irqchip.c > @@ -224,7 +224,8 @@ int kvm_set_irq_routing(struct kvm *kvm, > > kvm_arch_post_irq_routing_update(kvm); > > - synchronize_srcu_expedited(&kvm->irq_srcu); > + if (!(flags & NONEED_SYNC_SRCU)) > + synchronize_srcu_expedited(&kvm->irq_srcu); I'm not a fan of x86 passing in a magic flag. It's not immediately clear why skipping synchronization is safe. Piecing things together, _on x86_, I believe the answer is that vCPU can't have yet been created, kvm->lock is held, _and_ kvm_arch_irqfd_allowed() will subtly reject attempts to assign irqfds if the local APIC isn't in-kernel. But AFAICT, s390's implementation of KVM_CREATE_IRQCHIP, which sets up identical dummy/empty routing, doesn't provide the same guarantees. case KVM_CREATE_IRQCHIP: { struct kvm_irq_routing_entry routing; r = -EINVAL; if (kvm->arch.use_irqchip) { /* Set up dummy routing. */ memset(&routing, 0, sizeof(routing)); r = kvm_set_irq_routing(kvm, &routing, 0, 0); } break; } It's entirely possible that someday, kvm_setup_empty_irq_routing() is moved to common KVM and used for s390, at which point we have a mess on our hands because it's not at all obvious whether or not it's safe for s390 to also skip synchronization. Rather than hack in a workaround for x86, I would rather we try and clean up this mess. Except for kvm_irq_map_gsi(), it looks like all flows assume irq_routing is non-NULL. But I'm not remotely confident that that holds true on all architectures, e.g. the only reason kvm_irq_map_gsi() checks for a NULL irq_routing is because syzkaller generated a splat (commit c622a3c21ede ("KVM: irqfd: fix NULL pointer dereference in kvm_irq_map_gsi")). And on x86, I'm pretty sure as of commit 654f1f13ea56 ("kvm: Check irqchip mode before assign irqfd"), which added kvm_arch_irqfd_allowed(), it's impossible for kvm_irq_map_gsi() to encounter a NULL irq_routing _on x86_. But I strongly suspect other architectures can reach kvm_irq_map_gsi() with a NULL irq_routing, e.g. RISC-V dynamically configures its interrupt controller, yet doesn't implement kvm_arch_intc_initialized(). So instead of special casing x86, what if we instead have KVM setup an empty IRQ routing table during kvm_create_vm(), and then avoid this mess entirely? That way x86 and s390 no longer need to set empty/dummy routing when creating an IRQCHIP, and the worst case scenario of userspace misusing an ioctl() is no longer a NULL pointer deref.