(--- long post warning ---) This is a work that derived from the discussions I had with Laine Stump and Alex Williamson in [1]. I'll provide a quick gist below. ---------- Today, Libvirt does not have proper support for partial assignment of functions of passed-through PCI multifunction devices (hostdev with VFIO-PCI). By partial assignment I mean the guest being able to use just some, not all, virtual functions of the device. Even if the functions itself became useless in the host, the some functions might not be safe to be used by the guest, thus the user should be able to limit it. I mentioned 'proper' because today it is possible to get this done in Libvirt if we use 'managed=no' in the hostdevs. If the user makes the proper setup (i.e. detaching all IOMMU devices), and use managed='no', Libvirt will launch the guest just with the functions declared in the XML. The technical reason for this is simple: in virHostdevPreparePCIDevices() we do not take into account that multifunction PCI devices requires the whole IOMMU to be detached, not just the devices being declared in def->hostdevs. In this case, managed='yes' will not work in this scenario, causing errors in QEMU launch. The discussion I've started in [1] was motivated by my attempt of automatically detaching the IOMMU inside the prepare function with managed='yes' devices. Laine discarded this idea, arguing that the concept of partial assignment will cause user confusion if Libvirt starts to handle things without the user being fully aware. In [1] it was discussed the possibility of declaring the functions that won't be assigned to the guest in the XML, forcing the user to be aware that these functions will be lost in the host, as a possible approach for a solution. ----------- These series tries to solve the partial assignment of multifunction hostdev PCI devices by introducing a new hostdev attribute called 'assigned'. This is how it works: - it is a boolean value that will be efffective just for multifunction hostdev PCI devices, since there's no other occurrence for this kind of use in Libvirt. Trying to declare assign='yes|no' in any other PCI hostdev device will cause parse errors; - default value if the attribute is not present is 'assigned=yes'; - <address> element will be forbidden if the hostdev is declared with assigned='no'. This is to make more evident to the user that this is a function that the guest will NOT be using, with a bonus that we will not need to calculate an address that won't be used; - with this new mechanic, all the functions of the PCI multifunction device must be declared in the XML, even if they're not going to be assigned to the guest. This is being enforced inside the Prepare() function to avoid XML parsing errors and existing domains being erased. For consistency, this is being enforced regardless of the 'managed' flag - even the managed='no' case is going to adhere with this restriction. This is a XML snippet of this new attribute, applied to a hostdev passthrough of a multifunction Broadcom BCM5719 PCIe card with 4 functions: <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes' assigned='no'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x1'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes' assigned='no'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x2'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x3'/> </source> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x3'/> </hostdev> lspci inside the guest: $ lspci 0000:00:01.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon 0000:00:03.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01) 0000:00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device 0001:00:01.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0001:00:01.3 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) $ The intent of this design is to forbid a hostdev with 'assigned=no' to be attached to QEMU, to have an alias and to have a guest address. For all other purposes, including the processing done in virHostdevPreparePCIDevices(), the device is like any other hostdev - it will be added to activePCIHostdevs and, most important, it will be detached from the host IOMMU during guest start-up and will be re-attach back to the host with guest shutdown. In this very first interaction of this idea I most definitely missed/forgot to consider something, so any feedback is appreciated. This has direct implications into the multifunction pci hotplug/hot-unplug work (in fact, it was the hotplug work that revealed this gap in Libvirt support - I first noticed it seeing failed partial hotplug scenarios) and I would like to see it figured out, with this or any other approach, before continuing with the hotplug work. Thanks, DHB [1] https://www.redhat.com/archives/libvir-list/2019-July/msg01175.html Daniel Henrique Barboza (4): utils: introducing PCI multifunction detection helpers Introducing assigned='yes|no' hostdev attribute qemu_command: ignore non-assigned PCI multifunction hostdevs virhostdev: all functions of a PCI multifunction dev must be added docs/schemas/domaincommon.rng | 5 + src/conf/domain_conf.c | 36 ++++++- src/conf/domain_conf.h | 1 + src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 8 ++ src/qemu/qemu_command.c | 5 + src/qemu/qemu_domain_address.c | 7 ++ src/util/virhostdev.c | 88 +++++++++++++++++- src/util/virhostdev.h | 3 + src/util/virpci.c | 15 +++ src/util/virpci.h | 2 + ...ostdev-pci-multifunction-partial-fail.args | 31 ++++++ ...hostdev-pci-multifunction-partial-fail.xml | 35 +++++++ .../hostdev-pci-multifunction-partial.args | 31 ++++++ .../hostdev-pci-multifunction-partial.xml | 41 ++++++++ .../hostdev-pci-multifunction.args | 7 +- .../hostdev-pci-multifunction.xml | 6 ++ tests/qemuxml2argvtest.c | 8 ++ .../hostdev-pci-multifunction-partial.xml | 57 ++++++++++++ .../hostdev-pci-multifunction.xml | 23 +++-- .../qemuxml2xmloutdata/pseries-hostdevs-1.xml | 4 +- .../qemuxml2xmloutdata/pseries-hostdevs-2.xml | 4 +- .../qemuxml2xmloutdata/pseries-hostdevs-3.xml | 4 +- tests/qemuxml2xmltest.c | 1 + tests/virpcitestdata/0005-90-01.1.config | Bin 256 -> 256 bytes tests/virpcitestdata/0005-90-01.2.config | Bin 256 -> 256 bytes tests/virpcitestdata/0005-90-01.3.config | Bin 0 -> 256 bytes tools/virsh-domain.c | 5 + 28 files changed, 408 insertions(+), 21 deletions(-) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.args create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.xml create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.args create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.xml create mode 100644 tests/qemuxml2xmloutdata/hostdev-pci-multifunction-partial.xml create mode 100644 tests/virpcitestdata/0005-90-01.3.config -- 2.21.0 -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list