[Bug 200101] random freeze under load

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



https://bugzilla.kernel.org/show_bug.cgi?id=200101

--- Comment #3 from Garry Filakhtov (filakhtov@xxxxxxxxx) ---
Struggling with the same issue. Also coming from Gentoo 👋 lekto!

This was long coming, I just needed a lot of time to ensure there is no
hardware issues or any kind of misconfiguration on my end, before reporting
here.

I have Intel X299 platform and using it to run Windows 10 virtual machine with
PCI pass-through. I use NVMe SSD (Samsung EVO 970 Plus), PCIe USB 3.0 (StarTech
PEXUSB3S3GE) adapter and GPU (nVidia GeForce 1650) pass-through to get best
possible performance and isolation from host OS.

I have been running on 4.19 LTS kernel without any issues, but 5.4 LTS got
promoted to stable for AMD64 architecture and I have switched. After doing so,
I have started experiencing random guest freezes, happening anywhere
immediately after boot all the way up to multiple hours of usage without a
freeze. When the freeze occurs, guest machine will completely stop responding
to input, ping, etc. Host machine works fine and I can connect to qemu socket
without any problems. I am running on QEMU 4.2.0.

Freeze can continue anywhere from 1 minute up to 5 minutes, and eventually VM
is recovering and working properly afterwards, up until the next freeze.
Inspecting dmesg or journalctl on the host machine reveals no any relevant
entries.

Problem appears regardless of the type of workflow performed. It can just
freeze on the desktop, in the web browser or in the GPU benchmark. I was
playing music on the system and just before freezing, sound starts to
drop/glitch and then goes completely silent.

Windows event viewer is of course as useful as a fridge on the North pole
before the climate change :D (pardon my pun), meaning no entries are produced
during the freeze, and there is actually a gap between written entries for
however long the freeze took.

So far, I have tested a good variety of Kernel versions:

  [1]   linux-4.19.120-gentoo <- works fine
  [2]   linux-4.20.17-gentoo <- works fine
  [3]   linux-5.0.0-gentoo <- randomly freezes as described
  [4]   linux-5.0.21-gentoo <- randomly freezes as described
  [5]   linux-5.1.21-gentoo <- can't even boot guest, getting freeze during
very early boot
  [6]   linux-5.2.20-gentoo <- qemu won't even start, complaining about KVM
suberror 1
  [7]   linux-5.3.18-gentoo <- randomly freezes as described
  [8]   linux-5.4.38-gentoo <- randomly freezes as described

My takeaway here is that something went wrong in the 5.0.0 and was never fixed
since.

I have not yet tried to bisect the GIT source, but might give it a go, time
permitting.

I am using naked qemu-system-x86_64 command, to rule out virt-manager problems.
PCIe devices are attached via separate pcie-root-port devices. Using OVMF UEFI
(sys-firmware/edk2-ovmf-201905) for booting with Secure Boot enabled (disabling
Secure Boot makes no difference). I have also did clean Windows 10 install to
rule out any issues with the guest OS itself, but problem persisted. I have
tried using Windows-provided GPU drivers as well as the latest from nVidia.
Using "host" CPU for qemu.

There is a similar problem reported on Reddit too, the solution was to
downgrade:
https://www.reddit.com/r/VFIO/comments/b1xx0g/windows_10_qemukvm_freezes_after_50x_kernel_update/

Host hardware:
Motherboard: ASUS WS X299 SAGE
CPU: Intel i9-9940x
Guest GPU: nVidia GTX 1650
Host GPU: AMD Radeon PRO WX 3100
RAM: 64Gb (4x16Gb) DDR4 2666MHz
SSD: Samsung 970 EVO Plus
PCIe adapter: StarTech PEXUSB3S3GE 3xUSB3.0 + USB Realtek Gigabit network combo
adapter
Guest OS: Windows 10 Professional (1909)
QEMU version: 4.2.0

qemu options used:
-name Microsoft Windows 10 Professional
-M q35,kernel_irqchip=on,vmport=off,accel=kvm,mem-merge=off
-nodefaults
-display none
-vga none
-net none
-nographic
-monitor unix:/run/qemu/win10.sock,server,nowait
-pidfile /run/qemu/win10.pid
-cpu host,kvm=off
-smp sockets=1,cores=6,threads=2
-m size=16G
-drive
if=pflash,format=raw,readonly,file=/usr/share/edk2-ovmf/OVMF_CODE.secboot.fd
-drive if=pflash,format=raw,file=/usr/share/edk2-ovmf/OVMF_VARS.secboot.fd
-rtc base=localtime
-device pcie-root-port,id=port0.0,bus=pcie.0,chassis=0,slot=0,addr=1.0
-device vfio-pci,host=19:0.0,multifunction=on,bus=port0.0,addr=0.0
-device vfio-pci,host=19:0.1,bus=pcie.0,bus=port0.0,addr=0.1
-device pcie-root-port,id=port0.2,bus=pcie.0,chassis=0,slot=2
-device vfio-pci,host=1a:0.0,bus=port0.2
-device pcie-root-port,id=port0.5,bus=pcie.0,chassis=0,slot=5
-device vfio-pci,host=b3:0.0,bus=port0.5

I will try lekto's suggestion and report back any progress.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux