在 2021/5/8 上午1:36, Marc Zyngier 写道:
On Fri, 07 May 2021 12:02:57 +0100,
Marc Zyngier <maz@xxxxxxxxxx> wrote:
On Fri, 07 May 2021 10:58:23 +0100,
Shaokun Zhang <zhangshaokun@xxxxxxxxxxxxx> wrote:
Hi Marc,
Thanks for your quick reply.
On 2021/5/7 17:03, Marc Zyngier wrote:
On Fri, 07 May 2021 06:57:04 +0100,
Shaokun Zhang <zhangshaokun@xxxxxxxxxxxxx> wrote:
[This letter comes from Nianyao Tang]
Hi,
Using GICv4/4.1 and msi capability, guest vf driver requires 3
vectors and enable msi, will lead to guest stuck.
Stuck how?
Guest serial does not response anymore and guest network shutdown.
Qemu gets number of interrupts from Multiple Message Capable field
set by guest. This field is aligned to a power of 2(if a function
requires 3 vectors, it initializes it to 2).
So I guess this is a MultiMSI device with 4 vectors, right?
Yes, it can support maximum of 32 msi interrupts, and vf driver only use 3 msi.
However, guest driver just sends 3 mapi-cmd to vits and 3 ite
entries is recorded in host. Vfio initializes msi interrupts using
the number of interrupts 4 provide by qemu. When it comes to the
4th msi without ite in vits, in irq_bypass_register_producer,
producer and consumer will __connect fail, due to find_ite fail, and
do not resume guest.
Let me rephrase this to check that I understand it:
- The device has 4 vectors
- The guest only create mappings for 3 of them
- VFIO calls kvm_vgic_v4_set_forwarding() for each vector
- KVM doesn't have a mapping for the 4th vector and returns an error
- VFIO disable this 4th vector
Is that correct? If yes, I don't understand why that impacts the guest
at all. From what I can see, vfio_msi_set_vector_signal() just prints
a message on the console and carries on.
function calls:
--> vfio_msi_set_vector_signal
--> irq_bypass_register_producer
-->__connect
in __connect, add_producer finally calls kvm_vgic_v4_set_forwarding
and fails to get the 4th mapping. When add_producer fail, it does
not call cons->start, calls kvm_arch_irq_bypass_start and then
kvm_arm_resume_guest.
[+Eric, who wrote the irq_bypass infrastructure.]
Ah, so the guest is actually paused, not in a livelock situation
(which is how I interpreted "stuck").
I think we should handle this case gracefully, as there should be no
expectation that the guest will be using this interrupt. Given that
VFIO seems to be pretty unfazed when a producer fails, I'm temped to
do the same thing and restart the guest.
Also, __disconnect doesn't care about errors, so why should __connect
have this odd behaviour?
Can you please try this? It is completely untested (and I think the
del_consumer call is odd, which is why I've also dropped it).
Eric, what do you think?
Adding Zhu, Jason, MST to the party. It all seems to be caused by this
commit:
commit a979a6aa009f3c99689432e0cdb5402a4463fb88
Author: Zhu Lingshan <lingshan.zhu@xxxxxxxxx>
Date: Fri Jul 31 14:55:33 2020 +0800
irqbypass: do not start cons/prod when failed connect
If failed to connect, there is no need to start consumer nor
producer.
Signed-off-by: Zhu Lingshan <lingshan.zhu@xxxxxxxxx>
Suggested-by: Jason Wang <jasowang@xxxxxxxxxx>
Link: https://lore.kernel.org/r/20200731065533.4144-7-lingshan.zhu@xxxxxxxxx
Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
Zhu, I'd really like to understand why you think it is OK not to
restart consumer and producers when a connection has failed to be
established between the two?
My bad, I didn't check ARM code but it's not easy to infer that the
cons->start/stop is not a per consumer specific operation but a global
one like VM halting/resuming.
In the case of KVM/arm64, this results in the guest being forever
suspended and never resumed. That's obviously not an acceptable
regression, as there is a number of benign reasons for a connect to
fail.
Let's revert this commit.
Thanks
Thanks,
M.