Re: Windows guests crash when using IOMMU and more than one windows 10 VM on software raid (zfs or mdadm/lvm)

Brian Yglesias <brian@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Mon, 15 Aug 2016 02:29:31 -0700 (PDT)

I realized I don't belong here when I started getting the rest of your postings.  I've updated my situation and taken it over to the qemu mailing list where it may be more appropriate.  For posterity, here is what I posted there:

Hello everyone.

It seems the only way I can multi-seat to work is by having the OS and the VMs on a single disk, and after weeks of futility I'm starting to wonder if I can even replicate that.

I have two VMs which work surprisingly well with VFIO/IOMMU, unless I run them concurrently.  If I do, then the display driver will crash on one VM followed shortly by the other.  I've replicated this problem with multiple kernels from 4.2.1 to 4.7.X, and on two X58/LGA1366 MBs, so I suspect it affects most or all of them, at least when used with Debian / Proxmox.

There is nothing in the system logs to indicate why.

Here are the specs on the system I'm currently working on.

Distro:  Debian 8 / Proxmox 4.2
MB:  Asus Rampage III
CPU:  Xeon X5670
RAM:  24 GB

DISK1:  OS - XFS/LVM
DISK2-4:  VMs - ZFS RAIDZ-1

I've also seen the same on a GA-EX58 mb, set up identically.

I've tried ZFS, MDADM with and without LVM, I've tried MDADM raids 5, 1, and even 0.

I thought for sure that in the worst case scenario I would be able to assign a VM per disk.  Not so.

Oddly, it's actually gotten worse in that before I would need to start something 3D on both VMs in order to reliably crash both VMs (within seconds of each other usually).  Now all I need to do is start the second one, and the display driver will crash on the first one. (The fact that both VMs always crash has to be indicative of something, but not sure what.)

I'm pretty much back at the drawing board.  I'm actually starting to doubt that my 'single disk test' really worked.  Maybe I just didn't run it long enough?  So I will try that again.  Unfortunately, I only have spindle disks large enough to hold everything on hand right now, so it won't be an exact replica.

Beyond that, I really don't know.  I currently have the system set up in almost the most basic way I can to have something acceptable:
-OS on a single 120 GB SSD
-VM Root Pool on 3 240 GB SSD, Raid Z-1

Soft rebooting a VM will always cause that VM's display to get garbled on POST.  I don't even have to get into Windows, if that happens I know the VM is beyond salvation, and the second one is going down too.

I'm beginning to think this is somehow tied to my X58 chipset mbs (happens identically on both a Gigabyte and Asus board with that chipset), or the qemu/kvm that comes with Proxmox.  A third possibility may be some server-oriented tuning cooked into Proxmox.  (Maybe I'll do single disk this time with regular Debian, and see if there is some change.)

Here's how I invoke KVM:

VM1:

# sed -e 's/#.*$//' -e '/^$/d' /root/src/brian.1
/usr/bin/systemd-run \
--scope \
--slice qemu \
--unit 110 \
-p KillMode=none \
-p CPUShares=250000 \
/usr/bin/kvm -id 110 \
-chardev socket,id=qmp,path=/var/run/qemu-server/110.qmp,server,nowait \
-mon chardev=qmp,mode=control \
-pidfile /var/run/qemu-server/110.pid \
-daemonize \
-smbios type=1,uuid=6a9ea4a2-48bd-415e-95fb-adf8c9db44f7 \
-drive if=pflash,format=raw,readonly,file=/usr/share/kvm/OVMF-pure-efi.fd \
-drive if=pflash,format=raw,file=/root/sbin/110-OVMF_VARS-pure-efi.fd \
-name Brian-PC \
-smp 12,sockets=1,cores=12,maxcpus=12 \
-nodefaults \
-boot menu=on,strict=on,reboot-timeout=1000 \
-vga none \
-nographic \
-no-hpet \
-cpu host,hv_vendor_id=Nvidia43FIX,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_relaxed,+kvm_pv_unhalt,+kvm_pv_eoi,kvm=off \
-m 8192 \
-object memory-backend-ram,size=8192M,id=ram-node0 \
-numa node,nodeid=0,cpus=0-11,memdev=ram-node0 \
-k en-us \
-readconfig /usr/share/qemu-server/pve-q35.cfg \
-device usb-tablet,id=tablet,bus=ehci.0,port=1 \
-device vfio-pci,host=04:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0 \
-device vfio-pci,host=04:00.1,id=hostpci1,bus=ich9-pcie-port-2,addr=0x0 \
-device usb-host,hostbus=1,hostport=6.1 \
-device usb-host,hostbus=1,hostport=6.2.1 \
-device usb-host,hostbus=1,hostport=6.2.2 \
-device usb-host,hostbus=1,hostport=6.2.3 \
-device usb-host,hostbus=1,hostport=6.2 \
-device usb-host,hostbus=1,hostport=6.3 \
-device usb-host,hostbus=1,hostport=6.4 \
-device usb-host,hostbus=1,hostport=6.5 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 \
-drive file=/dev/zvol/SSD-pool/vm-110-disk-1,if=none,id=drive-virtio0,cache=writeback,format=raw,aio=threads,detect-zeroes=on \
-device virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100 \
-netdev type=tap,id=net0,ifname=tap110i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on \
-device virtio-net-pci,mac=32:61:36:63:37:64,netdev=net0,bus=pci.0,addr=0x12,id=net0 \
-rtc driftfix=slew,base=localtime \
-machine type=q35 \
-global kvm-pit.lost_tick_policy=discard

VM2:

# sed -e 's/#.*$//' -e '/^$/d' /root/src/madzia.2
/usr/bin/systemd-run \
--scope \
--slice qemu \
--unit 111 \
-p KillMode=none \
-p CPUShares=250000 \
/usr/bin/kvm \
-id 111 \
-chardev socket,id=qmp,path=/var/run/qemu-server/111.qmp,server,nowait \
-mon chardev=qmp,mode=control \
-pidfile /var/run/qemu-server/111.pid \
-daemonize \
-smbios type=1,uuid=55d862f4-d9b9-40ab-9b0a-e1eadf874750 \
-drive if=pflash,format=raw,readonly,file=/usr/share/kvm/OVMF-pure-efi.fd \
-drive if=pflash,format=raw,file=/root/sbin/111-OVMF_VARS-pure-efi.fd \
-name Madzia-PC \
-smp 12,sockets=1,cores=12,maxcpus=12 \
-nodefaults \
-boot menu=on,strict=on,reboot-timeout=1000 \
-vga none \
-nographic \
-no-hpet \
-cpu host,hv_vendor_id=Nvidia43FIX,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_relaxed,+kvm_pv_unhalt,+kvm_pv_eoi,kvm=off \
-m 8192 \
-object memory-backend-ram,size=8192M,id=ram-node0 \
-numa node,nodeid=0,cpus=0-11,memdev=ram-node0 \
-k en-us \
-readconfig /usr/share/qemu-server/pve-q35.cfg \
-device usb-tablet,id=tablet,bus=ehci.0,port=1 \
-device vfio-pci,host=05:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0 \
-device vfio-pci,host=05:00.1,id=hostpci1,bus=ich9-pcie-port-2,addr=0x0 \
-device usb-host,hostbus=2,hostport=2.1 \
-device usb-host,hostbus=2,hostport=2.2 \
-device usb-host,hostbus=2,hostport=2.3 \/
-device usb-host,hostbus=2,hostport=2.4 \
-device usb-host,hostbus=2,hostport=2.5 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 \
-iscsi initiator-name=iqn.1993-08.org.debian:01:1530d013b944 \
-drive file=/dev/zvol/SSD-pool/vm-111-disk-1,if=none,id=drive-virtio0,cache=writeback,format=raw,aio=threads,detect-zeroes=on \
-device virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100 \
-netdev type=tap,id=net0,ifname=tap111i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on \
-device virtio-net-pci,mac=4E:F0:DD:90:DB:2D,netdev=net0,bus=pci.0,addr=0x12,id=net0 \
-rtc driftfix=slew,base=localtime \
-machine type=q35 \
-global kvm-pit.lost_tick_policy=discard

However, I've tried many invocations of KVM without success.

Here is how I load my modules:

# cat /etc/modprobe.d/iommu_unsafe_interrupts.conf
options vfio_iommu_type1 allow_unsafe_interrupts=1

# cat /etc/modprobe.d/vfio_pci.conf
options vfio_pci disable_vga=1
#install vfio_pci /root/sbin/vfio-pci-override-vga.sh
options vfio-pci ids=10de:13c2,10de:0fbb,10de:11c0,10de:0e0b

# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=4299967296

# cat /etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1

... I believe grub is set up correctly ...

# sed -e 's/#.*$//' -e '/^$/d' /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Proxmox Virtual Environment"
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1 quiet"
GRUB_CMDLINE_LINUX=""
GRUB_DISABLE_OS_PROBER=true
GRUB_DISABLE_RECOVERY="true"

...  I believe I have all the correct modules loaded on boot ...

# sed -e 's/#.*$//' -e '/^$/d' /etc/modules
coretemp
it87
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

... Here's the Q35 config file ...

# sed -e 's/#.*$//' -e '/^$/d' /usr/share/qemu-server/pve-q35.cfg
[device "ehci"]
  driver = "ich9-usb-ehci1"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1d.7"
[device "uhci-1"]
  driver = "ich9-usb-uhci1"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1d.0"
  masterbus = "ehci.0"
  firstport = "0"
[device "uhci-2"]
  driver = "ich9-usb-uhci2"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1d.1"
  masterbus = "ehci.0"
  firstport = "2"
[device "uhci-3"]
  driver = "ich9-usb-uhci3"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1d.2"
  masterbus = "ehci.0"
  firstport = "4"
[device "ehci-2"]
  driver = "ich9-usb-ehci2"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1a.7"
[device "uhci-4"]
  driver = "ich9-usb-uhci4"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1a.0"
  masterbus = "ehci-2.0"
  firstport = "0"
[device "uhci-5"]
  driver = "ich9-usb-uhci5"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1a.1"
  masterbus = "ehci-2.0"
  firstport = "2"
[device "uhci-6"]
  driver = "ich9-usb-uhci6"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1a.2"
  masterbus = "ehci-2.0"
  firstport = "4"
[device "audio0"]
  driver = "ich9-intel-hda"
  bus = "pcie.0"
  addr = "1b.0"
[device "ich9-pcie-port-1"]
  driver = "ioh3420"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1c.0"
  port = "1"
  chassis = "1"
[device "ich9-pcie-port-2"]
  driver = "ioh3420"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1c.1"
  port = "2"
  chassis = "2"
[device "ich9-pcie-port-3"]
  driver = "ioh3420"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1c.2"
  port = "3"
  chassis = "3"
[device "ich9-pcie-port-4"]
  driver = "ioh3420"
  multifunction = "on"
  bus = "pcie.0"
  addr = "1c.3"
  port = "4"
  chassis = "4"
[device "pcidmi"]
  driver = "i82801b11-bridge"
  bus = "pcie.0"
  addr = "1e.0"
[device "pci.0"]
  driver = "pci-bridge"
  bus = "pcidmi"
  addr = "1.0"
  chassis_nr = "1"
[device "pci.1"]
  driver = "pci-bridge"
  bus = "pcidmi"
  addr = "2.0"
  chassis_nr = "2"
[device "pci.2"]
  driver = "pci-bridge"
  bus = "pcidmi"
  addr = "3.0"
  chassis_nr = "3"

... and plenty of CPU ...

# cat /proc/cpuinfo | grep -A 5 processor . "\\: 11"
# cat /proc/cpuinfo | grep  -A 4 processor.*": 11"
processor       : 11
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU           X 000  @ 2.93GHz

If anyone has any suggestions, I would greatly appreciate it.

----- Original Message -----
From: "Brian Yglesias" <brian@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
To: "kvm" <kvm@xxxxxxxxxxxxxxx>
Sent: Monday, August 1, 2016 3:53:41 AM
Subject: Windows guests crash when using IOMMU and more than one windows 10 VM on software raid (zfs or mdadm/lvm)

Passing a GPU to a VM and running a second VM will cause both VMs to
crash, if the root file system of the VM is on software raid.

OS:  Debian 8 / Proxmox 4.2

kernel:  4.X

qemu-server:   4.0-85

zfs module:  0.6.5.7-10_g27f3ec9

mdadm:  3.3.2-5+deb8u1

lvm:  2.02.116-pve2

Motherboard 1:  Asus Rampage III (latest bios)

Motherboard 2:  Gigabyte GA-EX58 (latest bios)

Chipset 1 and  2:  X58

CPU 1 and 2:  Xeon X5670

RAM 1 and 2:  24GB

GPU 1:  Geforce 660

GPU 2:  Geforce 970

The problem manifests slightly differently, depending on the software
raid.

Steps to reproduce:

Universal:

*Install Proxmox 4

*Select EXT4 or XFS for root FS

*Continue with sane settings for OS install

*apt-get update && apt-get dist-upgrade

*Set up IOMMU for intel as per:
https://pve.proxmox.com/wiki/Pci_passthrough

*Set up 2 VM with 8 GB RAM each

*Pass one GPU and one set of HID to each VM

*Verify functional multi-seat

ZFS:

*Set up zfs as per:  https://pve.proxmox.com/wiki/Storage:_ZFS

*Limit ARC to 4GB

*Set up pool for VM root disks

*Create VMs in pool, same as above

*Start VM 1, then Start VM2

*The VMs will likely not crash immediately, although they might

*To reliably cause GPU driver crash, run 3d accelerated programs on both.

*The nvidia driver will crash on one VM followed shortly by the other

MDADM/LVM:

*Set up mdadm raid array

*Create PV/VG/LV for VM root disks

*Create VMs in pool, same as above

*Start VM 1, then Start VM2

*The first VM's display will become scrambled, followed by the 2nd one
shortly after, no message of GPU driver crash

There is a difference of degree depending on the software raid.  In the
case of ZFS there is a good deal of variability in when the VMs will
crash.  On some occasions both VMs will run for extended periods of time
without issue, provided only one is doing anything requiring significant
3d hw acceleration.  In the case of MDADM/LVM, simply starting a second
VM, even with no attached PCI or USB devices, will cause the 1st VM to
crash before the 2nd VM has booted, and then the 2nd will crash.

This is only the case (thus far) when the VMs are Windows 10 and on sw
raid.  LXC containers or BSD based KVMs do not cause any problems,
although I have not tried passing hardware to them.

One VM at a time, even with GPU passthrough always works well, almost
surprisingly so.  Similarly, so does running both VMs concurrently, even
when both are performing 3d acceleration, provided the VMs' root disks are
not on software RAID.

I have not tried earlier versions of Windows, or linux kernel versions
prior to 4.  I have not tried with the root disks of the VM on non-raid,
and other disks on raid.  I have not tried "BIOS RAID" yet, though that is
probably my next step, pending a possible response from the list.

Thanks in advance,

Brian
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html