Possible virtio bug, ideas needed

Gernot Poerner <gernot.poerner@xxxxxxxxx> · Tue, 17 May 2022 16:21:55 +0200

Hi list,

I hope this is the right place to ask - if not please show me a better
place to submit this.

I am facing a problem with the virtio network driver which we were not
able to resolve yet
so I'm now pretty convinced now that this is really a bug.

Thing is - I cannot "make it fail" in a reproducable way in our
environment but it happens pretty
regularly, always with the same effect. Also, since it happens on
production Vms the need to
get them working again is pretty high.
I have set up a new cluster lately where it happens more often then in
the "standard"
environment so I was at least able to dig a bit deeper into it and do
some more tests.

The environment is a ganeti cluster running qemu-kvm as the
hypervisor. The underlying
Hardware is 64Bit X86 Intel with Mellanox network adapters (Connect-X
4 & 5). The vms are
connected via tap interfaces to a linux brige on the host nodes. I am
using both regular vlans
and vxlan (happens in both environments.)
The host and guest OS is Debian 10 Buster, running the same kernel
from backports (currently 5.10.0-0.bpo.12-amd64)

After a seemingly random time, Vms suddenly get inaccessible over the
network. A reboot
fixes it (soft or hard, doesn't matter), also a Vm migration fixes it.

When looking at the traffic with tcpdump, the problem seems to be the incoming
queue of the eth0 adapter inside the VM doesn't recieve any packets anymore.

All other hops see all the packets, including request (arp) of the
broken VM. The answers
reach the tap interface but get dropped on their way to eth0.
Since even arp doesn't work anymore, the VM is then dead in the water.
The interface
still has a link, there are no log entires or anything useful in dmesg.

Things I already veriefied and tried:

- made sure it's not any kind of mac address conflict
- there are no iptables/ebtables involved anywhere
- ifdown/ifup eth0 inside the VM has no effect
- there is no qos/traffic control involved on the host node
- we upgraded the kernel on the host nodes and vms (from Debian 10s
4.19 standard to the current 5.10) - no change
- we upgraded qemu-kvm from 5.4 to 7.0 - no change
- we tested self compiled kernel 5.15 and even 5.18 rc 7 on the VMs - no change
- we upgraded the firmware on the Mellanox adapters - no change
- we disabled any offloads on the virtual interface - no change
- I even thought about a very far fetched idea regarding pause frames
but am pretty sure that is not the case here
- in addition to a reboot or migration, what fixes it is to do ifdown
eth0, rmmod virtio_net, modprobe virtio_net, ifup eth0
- this finally convinced me that it probably is something with the virtio driver
- it seems to happen only on high traffic Vms, although it's possible
it would happen on others too if they reached some kind of
uptime/amount of network traffic
- the amount of traffic passed on the eth interface seems pretty
different between similar vms which got hit

I am now at some kind of dead end and have no real idea what I could
further do, besides asking and maybe
creating a bug report.
I would be especially interested in any kind of idea or tip on which
counters or proc entries I could watch
or what could be done to do even deeper debugging when the issue
occurs. I can give many more details if needed
too.

Any hints or help would be greatly appreciated.

G