netvsc device errors followed by unkillable Linux guest VM lockup?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We've been seeing something quite odd recently on some of our Windows systems with Linux guests.  Once every day or two, with no obvious immediate cause, the Linux kernel will emit a stream of errors:
       "hv_netvsc <UUID elided> eth2: unable to close device (ret -11)".

The VM then quickly becomes unkillable - attempts to "Turn Off" or "Reset" from the Hyper-V GUI or force-stop from PowerShell hang indefinitely and subsequent efforts throw errors about the VM failing to enter the requested state.
Within the Linux guest (before attempting to "Turn Off" etc. via Hyper-V) network-related commands like "ifconfig" hang, and an attempt to reboot will result in the same hung-unkillable guest symptom.

Only a Windows reboot seems to cure the issue.

These VMs have three virtual interfaces.  "eth2" is connected to a virtual switch which is connected to the physical ethernet interface of the Windows host.  We've never seen this problem on the other interfaces, which are, respectively, connected to the Windows host itself (eth1) and the Windows host's wireless interface (eth0).

The Linux kernel is what's currently shipped with Debian 12 (Bookworm): 6.1.0-11-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-4 (2023-08-08) x86_64 GNU/Linux .  Windows is 21H2 (Os Build 19044.3208).

These Linux guests were recently upgraded from Debian Stretch, which had older kernels (4.x - 5.x over the past few years).  On a few systems, after the Bookworm upgrade we saw some curious data-corruption problems on eth1 (the interface plumbed through to Windows, as distinct from eth2 where we are seeing the netvsc errors reported above) which after a great deal of experimentation we worked-around with "ethtool -offload eth1 scatter-gather off".  We'd never seen any issues like this (nor the issue we are currently seeing with eth2!) with the older kernels used with Stretch.  However, I should note that we do routinely apply Windows updates so it's possible that whatever has gone wrong is on the Windows side, not with the Linux guest's netvsc driver.

There is one unusual fact about the configuration of these systems.  Because their users routinely plug and unplug different physical Ethernet adapters from them, but we want to present only a single Ethernet to the Linux guest, we run a small Windows service which connects the virtual switch that is attached to eth2 of this guest to whichever physical interface of the host Windows system has most recently presented carrier (according to WMI).  However, the service never monkeys with the connection of the guest VM itself to the vswitch.  We've been running this service in one form or another since 2016 and it hasn't changed lately; it seems unlikely to me it's involved in what's suddenly gone wrong here; but I figured I should mention it for completeness.

Does anyone have any idea what might be going on resulting in what looks like the sudden disappearance of eth2's netvsc, or how to debug?

Thanks!

Thor





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux