On Wed, Aug 24, 2022 at 3:52 PM Alvaro Karsz <alvaro.karsz@xxxxxxxxxxxxx> wrote: > > I think that we should add a timeout to the control virtqueue commands. > If the hypervisor crashes while handling a control command, the guest > will spin forever. > This may not be necessary for a virtual environment, when both the > hypervisor and the guest OS run in the same bare metal, but this > is needed for a physical network device compatible with VirtIO. > > (In these cases, the network device acts as the hypervisor, and the > server acts as > the guest OS). > > The network device may fail to answer a control command, or may crash, leading > to a stall in the server. > > My idea is to add a big enough timeout, to allow the slow devices to > complete the command. > > I wrote a simple patch that returns false from virtnet_send_command in > case of timeouts. > > The timeout approach introduces some serious problems in cases when > the network device does answer the control command, but after the > timeout. > > * The device will think that the command succeeded, while the server won't. > This may be serious with the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command. > The server may receive packets in an unexpected queue. > > * virtqueue_get_buf will return the previous response for the next > control command. > > Addressing this case by adding a timeout to the spec won't be easy, > since the network device and the server have different clocks, and the > server won't know when exactly the network device noticed the kick. > > So maybe we should call virtnet_remove if we reach a timeout. Or reset but can we simply use interrupt instead of the busy waiting? Thanks > > Or maybe we can just assume that the network device crashed after a > long timeout, and nothing should be done. > > What do you guys think? > > Alvaro > _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization