On Tue, Oct 6, 2020 at 10:21 PM Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > > On Mon, 5 Oct 2020 11:05:05 -0400 > Marc Smith <msmith626@xxxxxxxxx> wrote: > > > Hi, > > > > I'm using QEMU/KVM on RHEL (CentOS) 7.8.2003: > > # cat /etc/redhat-release > > CentOS Linux release 7.8.2003 > > > > I'm passing an NVMe drive into a Linux KVM virtual machine (<type > > arch='x86_64' machine='pc-i440fx-rhel7.0.0'>hvm</type>) which has the > > following 'hostdev' entry: > > <hostdev mode='subsystem' type='pci' managed='yes'> > > <driver name='vfio'/> > > <source> > > <address domain='0x0000' bus='0x42' slot='0x00' function='0x0'/> > > </source> > > <alias name='hostdev5'/> > > <rom bar='off'/> > > <address type='pci' domain='0x0000' bus='0x01' slot='0x0f' > > function='0x0'/> > > </hostdev> > > > > This all works fine during normal operation, but I noticed when we > > remove the NVMe drive (surprise hotplug event), the PCIe EP then seems > > "stuck"... here we see the link-down event on the host (when the drive > > is removed): > > [67720.177959] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Down > > [67720.178027] vfio-pci 0000:42:00.0: Relaying device request to user (#0) > > > > And naturally, inside of the Linux VM, we see the NVMe controller drop: > > [ 1203.491536] nvme nvme1: controller is down; will reset: > > CSTS=0xffffffff, PCI_STATUS=0xffff > > [ 1203.522759] blk_update_request: I/O error, dev nvme1n2, sector > > 33554304 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 > > [ 1203.560505] nvme 0000:01:0f.0: Refused to change power state, currently in D3 > > [ 1203.561104] nvme nvme1: Removing after probe failure status: -19 > > [ 1203.583506] Buffer I/O error on dev nvme1n2, logical block 4194288, > > async page read > > [ 1203.583514] blk_update_request: I/O error, dev nvme1n1, sector > > 33554304 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 > > > > We see this EP is found at IOMMU group '76': > > # readlink /sys/bus/pci/devices/0000\:42\:00.0/iommu_group > > ../../../../kernel/iommu_groups/76 > > > > And it is no longer bound to the 'vfio-pci' driver (expected) on the > > host. I was expecting to see all of the FD's to the /dev/vfio/NN > > character devices closed, but it seems they are still open: > > # lsof | grep "vfio/76" > > qemu-kvm 242364 qemu 70u CHR 235,4 > > 0t0 3925324 /dev/vfio/76 > > qemu-kvm 242364 242502 qemu 70u CHR 235,4 > > 0t0 3925324 /dev/vfio/76 > > qemu-kvm 242364 242511 qemu 70u CHR 235,4 > > 0t0 3925324 /dev/vfio/76 > > qemu-kvm 242364 242518 qemu 70u CHR 235,4 > > 0t0 3925324 /dev/vfio/76 > > qemu-kvm 242364 242531 qemu 70u CHR 235,4 > > 0t0 3925324 /dev/vfio/76 > > qemu-kvm 242364 242533 qemu 70u CHR 235,4 > > 0t0 3925324 /dev/vfio/76 > > qemu-kvm 242364 242542 qemu 70u CHR 235,4 > > 0t0 3925324 /dev/vfio/76 > > qemu-kvm 242364 242550 qemu 70u CHR 235,4 > > 0t0 3925324 /dev/vfio/76 > > qemu-kvm 242364 242554 qemu 70u CHR 235,4 > > 0t0 3925324 /dev/vfio/76 > > SPICE 242364 242559 qemu 70u CHR 235,4 > > 0t0 3925324 /dev/vfio/76 > > > > After the NVMe drive was removed for 100 seconds, we see the following > > kernel messages on the host: > > [67820.179749] vfio-pci 0000:42:00.0: Relaying device request to user (#10) > > [67900.272468] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars > > [67900.272652] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars > > [67900.319284] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars > > > > I also noticed these messages related to the EP that is down currently > > that seem to continue indefinitely on the host (every 100 seconds): > > [67920.181882] vfio-pci 0000:42:00.0: Relaying device request to user (#20) > > [68020.184945] vfio-pci 0000:42:00.0: Relaying device request to user (#30) > > [68120.188209] vfio-pci 0000:42:00.0: Relaying device request to user (#40) > > [68220.190397] vfio-pci 0000:42:00.0: Relaying device request to user (#50) > > [68320.192575] vfio-pci 0000:42:00.0: Relaying device request to user (#60) > > > > But perhaps that is expected behavior. In any case, the problem comes > > when I re-insert the NVMe drive into the system... on the host, we see > > the link-up event: > > [68418.595101] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Up > > > > But the device is not bound to the 'vfio-pci' driver: > > # ls -ltr /sys/bus/pci/devices/0000\:42\:00.0/driver > > ls: cannot access /sys/bus/pci/devices/0000:42:00.0/driver: No such > > file or directory > > > > And appears to fail when attempting to bind to it manually: > > # echo "0000:42:00.0" > /sys/bus/pci/drivers/vfio-pci/bind > > -bash: echo: write error: No such device > > > > Device is enabled: > > # cat /sys/bus/pci/devices/0000\:42\:00.0/enable > > 1 > > > > So, wondering if this is expected behavior? Stopping the VM and > > starting it (virsh destroy/start) allows the device to work in the VM > > again, but for my particular use case, this is not an option. Need the > > surprise hotplug functionality to work with the PCIe EP passed into > > the VM. And perhaps this is an issue elsewhere (eg, vfio-pci). Any > > tips/suggestions on where to dig more would be appreciated. > > Sorry, but nothing about what you're trying to accomplish is supported. > vfio-pci only supports cooperative hotplug, and that's what it's trying > to implement here. The internal kernel PCI object is being torn down > even after the device has been physically removed, the PCI core is > trying to unbind it from the driver, which is where you're seeing the > device requests being relayed to the user. The user (QEMU or guest) is > probably hung up trying to access the device that no long exists to > respond to these unplug requests. > > Finally, you've added the device back, but there's an entire chain of > policy decisions that needs to decide to bind that new device to > vfio-pci, decide that this guest should have access to that device, and > initiate a hot-add to the VM. That simply doesn't exist. Should this > guest still have access to the device at that bus address? Why? What > if it's an entirely new and different device? Who decides? Understood, not supported currently. > > Someone needs to decide that this is a worthwhile feature to implement > and invest time to work out all these details before it "just works". > Perhaps you could share your use case to add weight to whether this is > something that should be pursued. The behavior you see is expected and > there is currently no ETA (or active development that I'm aware of) for > the behavior you desire. Thanks, In this case, I'm passing NVMe drives into a KVM virtual machine -- the VM is then the "application" that uses these NVMe storage devices. Why? Good question. =) Knowing how the current implementation works now, I may rethink this a bit. Thanks for your time and information. --Marc > > Alex >