Hi list, I hope this is the right place to ask - if not please show me a better place to submit this. I am facing a problem with the virtio network driver which we were not able to resolve yet so I'm now pretty convinced now that this is really a bug. Thing is - I cannot "make it fail" in a reproducable way in our environment but it happens pretty regularly, always with the same effect. Also, since it happens on production Vms the need to get them working again is pretty high. I have set up a new cluster lately where it happens more often then in the "standard" environment so I was at least able to dig a bit deeper into it and do some more tests. The environment is a ganeti cluster running qemu-kvm as the hypervisor. The underlying Hardware is 64Bit X86 Intel with Mellanox network adapters (Connect-X 4 & 5). The vms are connected via tap interfaces to a linux brige on the host nodes. I am using both regular vlans and vxlan (happens in both environments.) The host and guest OS is Debian 10 Buster, running the same kernel from backports (currently 5.10.0-0.bpo.12-amd64) After a seemingly random time, Vms suddenly get inaccessible over the network. A reboot fixes it (soft or hard, doesn't matter), also a Vm migration fixes it. When looking at the traffic with tcpdump, the problem seems to be the incoming queue of the eth0 adapter inside the VM doesn't recieve any packets anymore. All other hops see all the packets, including request (arp) of the broken VM. The answers reach the tap interface but get dropped on their way to eth0. Since even arp doesn't work anymore, the VM is then dead in the water. The interface still has a link, there are no log entires or anything useful in dmesg. Things I already veriefied and tried: - made sure it's not any kind of mac address conflict - there are no iptables/ebtables involved anywhere - ifdown/ifup eth0 inside the VM has no effect - there is no qos/traffic control involved on the host node - we upgraded the kernel on the host nodes and vms (from Debian 10s 4.19 standard to the current 5.10) - no change - we upgraded qemu-kvm from 5.4 to 7.0 - no change - we tested self compiled kernel 5.15 and even 5.18 rc 7 on the VMs - no change - we upgraded the firmware on the Mellanox adapters - no change - we disabled any offloads on the virtual interface - no change - I even thought about a very far fetched idea regarding pause frames but am pretty sure that is not the case here - in addition to a reboot or migration, what fixes it is to do ifdown eth0, rmmod virtio_net, modprobe virtio_net, ifup eth0 - this finally convinced me that it probably is something with the virtio driver - it seems to happen only on high traffic Vms, although it's possible it would happen on others too if they reached some kind of uptime/amount of network traffic - the amount of traffic passed on the eth interface seems pretty different between similar vms which got hit I am now at some kind of dead end and have no real idea what I could further do, besides asking and maybe creating a bug report. I would be especially interested in any kind of idea or tip on which counters or proc entries I could watch or what could be done to do even deeper debugging when the issue occurs. I can give many more details if needed too. Any hints or help would be greatly appreciated. G