Hi, On 5/7/21 1:02 PM, Marc Zyngier wrote: > On Fri, 07 May 2021 10:58:23 +0100, > Shaokun Zhang <zhangshaokun@xxxxxxxxxxxxx> wrote: >> >> Hi Marc, >> >> Thanks for your quick reply. >> >> On 2021/5/7 17:03, Marc Zyngier wrote: >>> On Fri, 07 May 2021 06:57:04 +0100, >>> Shaokun Zhang <zhangshaokun@xxxxxxxxxxxxx> wrote: >>>> >>>> [This letter comes from Nianyao Tang] >>>> >>>> Hi, >>>> >>>> Using GICv4/4.1 and msi capability, guest vf driver requires 3 >>>> vectors and enable msi, will lead to guest stuck. >>> >>> Stuck how? >> >> Guest serial does not response anymore and guest network shutdown. >> >>> >>>> Qemu gets number of interrupts from Multiple Message Capable field >>>> set by guest. This field is aligned to a power of 2(if a function >>>> requires 3 vectors, it initializes it to 2). >>> >>> So I guess this is a MultiMSI device with 4 vectors, right? >>> >> >> Yes, it can support maximum of 32 msi interrupts, and vf driver only use 3 msi. >> >>>> However, guest driver just sends 3 mapi-cmd to vits and 3 ite >>>> entries is recorded in host. Vfio initializes msi interrupts using >>>> the number of interrupts 4 provide by qemu. When it comes to the >>>> 4th msi without ite in vits, in irq_bypass_register_producer, >>>> producer and consumer will __connect fail, due to find_ite fail, and >>>> do not resume guest. >>> >>> Let me rephrase this to check that I understand it: >>> - The device has 4 vectors >>> - The guest only create mappings for 3 of them >>> - VFIO calls kvm_vgic_v4_set_forwarding() for each vector >>> - KVM doesn't have a mapping for the 4th vector and returns an error >>> - VFIO disable this 4th vector >>> >>> Is that correct? If yes, I don't understand why that impacts the guest >>> at all. From what I can see, vfio_msi_set_vector_signal() just prints >>> a message on the console and carries on. >>> >> >> function calls: >> --> vfio_msi_set_vector_signal >> --> irq_bypass_register_producer >> -->__connect >> >> in __connect, add_producer finally calls kvm_vgic_v4_set_forwarding >> and fails to get the 4th mapping. When add_producer fail, it does >> not call cons->start, calls kvm_arch_irq_bypass_start and then >> kvm_arm_resume_guest. > > [+Eric, who wrote the irq_bypass infrastructure.] > > Ah, so the guest is actually paused, not in a livelock situation > (which is how I interpreted "stuck"). > > I think we should handle this case gracefully, as there should be no > expectation that the guest will be using this interrupt. Given that > VFIO seems to be pretty unfazed when a producer fails, I'm temped to > do the same thing and restart the guest. > > Also, __disconnect doesn't care about errors, so why should __connect > have this odd behaviour? _disconnect() does not care as we should always succeed tearing off things. del_* ops are void functions. On the opposite we can fail setting up the bypass. Effectively a979a6aa009f ("irqbypass: do not start cons/prod when failed connect") needs to be reverted. I agree the kerneldoc comments in linux/irqbypass.h may be improved to better explain the role of stop/start cbs and warn about their potential global impact. wrt the case above, "in __connect, add_producer finally calls kvm_vgic_v4_set_forwarding and fails to get the 4th mapping", shouldn't we succeed in that case? Thanks Eric > > Can you please try this? It is completely untested (and I think the > del_consumer call is odd, which is why I've also dropped it). > > Eric, what do you think? > > Thanks, > > M. > > diff --git a/virt/lib/irqbypass.c b/virt/lib/irqbypass.c > index c9bb3957f58a..7e1865e15668 100644 > --- a/virt/lib/irqbypass.c > +++ b/virt/lib/irqbypass.c > @@ -40,21 +40,14 @@ static int __connect(struct irq_bypass_producer *prod, > if (prod->add_consumer) > ret = prod->add_consumer(prod, cons); > > - if (ret) > - goto err_add_consumer; > - > - ret = cons->add_producer(cons, prod); > - if (ret) > - goto err_add_producer; > + if (!ret) > + ret = cons->add_producer(cons, prod); > > if (cons->start) > cons->start(cons); > if (prod->start) > prod->start(prod); > -err_add_producer: > - if (prod->del_consumer) > - prod->del_consumer(prod, cons); > -err_add_consumer: > + > return ret; > } > >