Re: PCI Passthrough and Surprise Hotplug

Marc Smith <msmith626@xxxxxxxxx> · Wed, 7 Oct 2020 20:21:49 -0400

On Tue, Oct 6, 2020 at 10:21 PM Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:
>
> On Mon, 5 Oct 2020 11:05:05 -0400
> Marc Smith <msmith626@xxxxxxxxx> wrote:
>
> > Hi,
> >
> > I'm using QEMU/KVM on RHEL (CentOS) 7.8.2003:
> > # cat /etc/redhat-release
> > CentOS Linux release 7.8.2003
> >
> > I'm passing an NVMe drive into a Linux KVM virtual machine (<type
> > arch='x86_64' machine='pc-i440fx-rhel7.0.0'>hvm</type>) which has the
> > following 'hostdev' entry:
> >     <hostdev mode='subsystem' type='pci' managed='yes'>
> >       <driver name='vfio'/>
> >       <source>
> >         <address domain='0x0000' bus='0x42' slot='0x00' function='0x0'/>
> >       </source>
> >       <alias name='hostdev5'/>
> >       <rom bar='off'/>
> >       <address type='pci' domain='0x0000' bus='0x01' slot='0x0f'
> > function='0x0'/>
> >     </hostdev>
> >
> > This all works fine during normal operation, but I noticed when we
> > remove the NVMe drive (surprise hotplug event), the PCIe EP then seems
> > "stuck"... here we see the link-down event on the host (when the drive
> > is removed):
> > [67720.177959] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Down
> > [67720.178027] vfio-pci 0000:42:00.0: Relaying device request to user (#0)
> >
> > And naturally, inside of the Linux VM, we see the NVMe controller drop:
> > [ 1203.491536] nvme nvme1: controller is down; will reset:
> > CSTS=0xffffffff, PCI_STATUS=0xffff
> > [ 1203.522759] blk_update_request: I/O error, dev nvme1n2, sector
> > 33554304 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
> > [ 1203.560505] nvme 0000:01:0f.0: Refused to change power state, currently in D3
> > [ 1203.561104] nvme nvme1: Removing after probe failure status: -19
> > [ 1203.583506] Buffer I/O error on dev nvme1n2, logical block 4194288,
> > async page read
> > [ 1203.583514] blk_update_request: I/O error, dev nvme1n1, sector
> > 33554304 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> >
> > We see this EP is found at IOMMU group '76':
> > # readlink /sys/bus/pci/devices/0000\:42\:00.0/iommu_group
> > ../../../../kernel/iommu_groups/76
> >
> > And it is no longer bound to the 'vfio-pci' driver (expected) on the
> > host. I was expecting to see all of the FD's to the /dev/vfio/NN
> > character devices closed, but it seems they are still open:
> > # lsof | grep "vfio/76"
> > qemu-kvm  242364              qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242502       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242511       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242518       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242531       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242533       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242542       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242550       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > qemu-kvm  242364 242554       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> > SPICE     242364 242559       qemu   70u      CHR              235,4
> >       0t0    3925324 /dev/vfio/76
> >
> > After the NVMe drive was removed for 100 seconds, we see the following
> > kernel messages on the host:
> > [67820.179749] vfio-pci 0000:42:00.0: Relaying device request to user (#10)
> > [67900.272468] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars
> > [67900.272652] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars
> > [67900.319284] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars
> >
> > I also noticed these messages related to the EP that is down currently
> > that seem to continue indefinitely on the host (every 100 seconds):
> > [67920.181882] vfio-pci 0000:42:00.0: Relaying device request to user (#20)
> > [68020.184945] vfio-pci 0000:42:00.0: Relaying device request to user (#30)
> > [68120.188209] vfio-pci 0000:42:00.0: Relaying device request to user (#40)
> > [68220.190397] vfio-pci 0000:42:00.0: Relaying device request to user (#50)
> > [68320.192575] vfio-pci 0000:42:00.0: Relaying device request to user (#60)
> >
> > But perhaps that is expected behavior. In any case, the problem comes
> > when I re-insert the NVMe drive into the system... on the host, we see
> > the link-up event:
> > [68418.595101] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Up
> >
> > But the device is not bound to the 'vfio-pci' driver:
> > # ls -ltr /sys/bus/pci/devices/0000\:42\:00.0/driver
> > ls: cannot access /sys/bus/pci/devices/0000:42:00.0/driver: No such
> > file or directory
> >
> > And appears to fail when attempting to bind to it manually:
> > # echo "0000:42:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
> > -bash: echo: write error: No such device
> >
> > Device is enabled:
> > # cat /sys/bus/pci/devices/0000\:42\:00.0/enable
> > 1
> >
> > So, wondering if this is expected behavior? Stopping the VM and
> > starting it (virsh destroy/start) allows the device to work in the VM
> > again, but for my particular use case, this is not an option. Need the
> > surprise hotplug functionality to work with the PCIe EP passed into
> > the VM. And perhaps this is an issue elsewhere (eg, vfio-pci). Any
> > tips/suggestions on where to dig more would be appreciated.
>
> Sorry, but nothing about what you're trying to accomplish is supported.
> vfio-pci only supports cooperative hotplug, and that's what it's trying
> to implement here.  The internal kernel PCI object is being torn down
> even after the device has been physically removed, the PCI core is
> trying to unbind it from the driver, which is where you're seeing the
> device requests being relayed to the user.  The user (QEMU or guest) is
> probably hung up trying to access the device that no long exists to
> respond to these unplug requests.
>
> Finally, you've added the device back, but there's an entire chain of
> policy decisions that needs to decide to bind that new device to
> vfio-pci, decide that this guest should have access to that device, and
> initiate a hot-add to the VM.  That simply doesn't exist.  Should this
> guest still have access to the device at that bus address?  Why?  What
> if it's an entirely new and different device?  Who decides?

Understood, not supported currently.

>
> Someone needs to decide that this is a worthwhile feature to implement
> and invest time to work out all these details before it "just works".
> Perhaps you could share your use case to add weight to whether this is
> something that should be pursued.  The behavior you see is expected and
> there is currently no ETA (or active development that I'm aware of) for
> the behavior you desire.  Thanks,

In this case, I'm passing NVMe drives into a KVM virtual machine --
the VM is then the "application" that uses these NVMe storage devices.
Why? Good question. =)

Knowing how the current implementation works now, I may rethink this a
bit. Thanks for your time and information.

--Marc

>
> Alex
>