Re: Disabling PCI "hot-unplug" for a guest (and/or a single PCI device)

Laine Stump <laine@xxxxxxxxxx> · Wed, 5 Feb 2020 11:29:58 -0500

On 2/4/20 1:43 PM, Daniel P. Berrangé wrote:
On Mon, Feb 03, 2020 at 05:19:51PM -0500, Laine Stump wrote:
Although I've never experienced it, due to not running Windows guests, I've
recently learned that a Windows guest permits a user (hopefully only one
with local admin privileges??!) to "hot-unplug" any PCI device. I've also
learned that some hypervisor admins don't want to permit admins of the
virtual machines they're managing to unplug PCI devices. I believe this is
impossible to prevent on an i440fx-based machinetype, and can only be done
on a q35-based machinetype by assigning the devices to the root bus (so that
they are seen as integrated devices) rather than to a pcie-root-port. But
when libvirt is assigning PCI addresses to devices in a q35-base guest, it
will *always* assign a PCIe device to a pcie-root-port specifically so that
hotplug is possible (this was done to maintain functional parity with i440fx
guests, where all PCI slots support hotplug).

After speaking with Alex & Laine on IRC, I learnt some further relevant
points

  - In a "typical" physical machine PCI slots will not be marked
    as hotpluggable - /sys/bus/pci/slots only has 2 entries on
    my HP DL180, corresponding to unused physical PCI slots

  - QEMU is hardcoded to report all pci & pcie-root-ports as hotpluggable.
    So /sys/bus/pci/slots on i440fx has 31 entries, one for every
    device, while on q35 it has one entry for every pcie-root-port
    IIUC.

  - It is conceptually possible to enhance pcie-root-port device
    to allow its hotplug capability to be toggled. Alternatively
    a parallel  pcie-root-port-nohotplug device could be created.
    The end result would be the same from guest POV

  - The vfio-pci device has a companion vfio-pci-nohotplug
    device. The difference is simply whether the QEMU DEviceClass
    has the "hotpluggable" attribute set, and is separate from
    whether the PCI(e) root port has hotplug enabled

The last point here about vfio-pci si particularly important,
as it shows libvirt needs to be capable of tracking hotpluggability
independantly on the PCI port and the PCI device attached to the
port.

1) Rather than leaving the task of assignung the PCI addresses of devices to
libvirt (which is what essentially *all* management apps that use libvirt
currently do), the management application could itself directly assign the
PCI addressed of all devices to be slots on pcie.0.

[snip]

This is essentially a hack to work around the fact the the pcie-root-port
is hardcoded to report itself as hotpluggable.

As such I don't consider this is serious long term solution. If you
absolutely cannot wait for a newer libvirt/QEMU, this solution could
be used as a quick hack for mgmt apps, but long term we need todo
better.

2) libvirt could gain a knob "somewhere" in the domain XML to force a single
device, or all devices, to be assigned to a PCI address on pcie.0 rather
than on a pcie-root-port. This could be thought of as a "hint" about device
placement, as well as extra validation in the case that a PCI address has
been manually assigned. So, for example, let's say a "hotplug='disable'"
option is added somewhere at the top level of the domain (maybe "<hotplug
enable='no'/>" inside <features> or something like that); when PCI addresses
are assigned by libvirt, it would attempt to find a slot on a controller
that didn't support hotplug. And/or a similar knob could be added to each
device. In both cases, the setting would be used both when assigning PCI
addresses and also to validate user-provided PCI addresses to assure that
the desired criterion was met (otherwise someone would manually select a PCI
address on a controller that supported hotplug, but then set
"hotplug='disabled'" and expect hotplug to be magically disabled on the
slot).

Essentially this is the using "hotpluggable=yes|no" on the device as
a policy knob to control device placement, to again workaround the
fact that pcie-root-port are hardcoded to always report themselves
as hotpluggable.

So I'd think this should be ruled out for the same reason as
option 1.

It has a second downside though. As we see from vfio-pci-nohotplug,
there is a valid use case for a "hotpluggable=yes|no" attribute on
a device for controlling a specific hardware config choice in QEMU.

We don't want to overload this attribute to both control use of
vfio-pci-nohotplug, and also be a policy knob for device placement.

3) qemu could add a "hotpluggable=no" commandline option to all PCI devices
(including vfio-pci) and then do whatever is necessary to make sure this is
honored in the emulated hardware (is it possible to set this on a per-slot
basis in a PCI controller? Or must it be done for an entire controller? I
suppose it's not as much of an issue for pcie-root-port, as long as you're
not using multiple functions). libvirt would then need to add this option to
the XML for each device, and management applications would need to set it -
it would essentially look the same to the management application, but it
would be implemented differently - instead of libvirt using that flag to
make a choice about which slot to assign, it would assign PCI addresses in
the same manner as before, and use the libvirt XML flag to set a QEMU
commandline flag for the device.

I think this, or something close to it is the desirable way forward
here, as it gives us more explicit control over what the emulated
hardware actually advertizes. So instead of trying to workaround
limitations of QEMU, we'd be working /with/ QEMU to improve its
feature offerings.

In QEMU pcie-root-port either needs to gain a hotpluggable=yes|no
attribute, or a second pcie-root-port-nohotplug needs adding.

Withever of these two approaches are taken in QEMU, this can be
controlled from libvirt via an attribute on the controller.eg

    <controller type='pci' model='pcie-root-port' hotpluggable="no|yes"/>

This hotpluggable attribute can be mapped to whichever CLI syntax
QEMU wants to support.

This alone is probably sufficient for the Windows problem motivating
this thread.

There already exists the vfio-pci-nohotplug device, but this is not
exposed by libvirt. So we can add an attribute to <hostdev> to
control its use.

The remaining question is whether there's any compelling reason to
add non-hotpluggable variants of other devices, virtio-net-pci-nohotplug ?

I'd probably /not/ do this, unless there's a clear compelling benefit
it gives which can't be achieved already via the pcie-root-port
hotpluggability controls.

For management applications, with Q35 we already recommend that they
explicitly add *many*

    <controller type='pci' model='pcie-root-port'/>

to new guests. Enough to cover all the initial cold-plugged devices,
and enough spare ports to enable future hotplug of extra devices.
OpenStack for example will add 32 pcie-root-ports, so that Q35 has
approximately the same hotplug capacity as i440fx would have offered.

To control hotplug, management apps simply need tweak what they're
doing with pcie-root-ports with an extra attribute

eg, consider there were 4 devices on the initially booted VM which
need hotplug disabled, and we still want freedom to hotplug 2
extra devices at runtime.

    <controller type='pci' model='pcie-root-port' hotplug="no"/>
    <controller type='pci' model='pcie-root-port' hotplug="no"/>
    <controller type='pci' model='pcie-root-port' hotplug="no"/>
    <controller type='pci' model='pcie-root-port' hotplug="no"/>
    <controller type='pci' model='pcie-root-port' hotplug="yes"/>
    <controller type='pci' model='pcie-root-port' hotplug="yes"/>

This is quite easy,

Not quite so easy as you might think :-). The problem is that, in our 
fervor to make Q35 guests "as similar as possible" to 440fx guests, we 
have made the PCI address assignment code search out a hotplug-capable 
slot for each unassigned device, and if no hotpluggable slot is 
available it automatically adds a new root port so that there is a 
hotplug-capable slot. With the current set of VIR_PCI_CONNECT_TYPE 
flags, unless we lie to that code and tell it that these new root ports 
support hotplug, the the devices will be assigned first to the two 
available hotpluggable root ports (in your example above), and then when 
those are both used, it will start adding new root ports - the ports 
with hotplug='no' will never be used.

On the other hand, if we change the code to *not* require a 
hotplug-capable port, then the new devices will just be assigned to 
slots on the root bus.

So the address assignment code is going to need to be re-jiggered. I 
guess we would need to create yet another PCI controller capability for 
use by the address assignment code - "Not 
Hot-Pluggable-But-Still-Okay-For-Auto-Assign-When-The-Guest-Is-Powered-Off".

So, a mental exercise - let's say we make a new virDomainPCIConnectFlag 
called VIR_PCI_CONNECT_AUTO_ASSIGNABLE (that's a bit easier to say than 
what I had in the previous paragraph). That connect type would be added 
to all the controllers that currently have CONNECT_HOTPLUGGABLE set:

    pci-root
    pci-bridge
    pcie-root
    pcie-pci-bridge
    pcie-root-port
    pcie-downstream-port

We then add a new controller model 
VIR_DOMAIN_CONTROLLER_PCIE_ROOT_PORT_NO_HOTPLUG and set it as 
AUTO_ASSIGNABLE, but *not* HOTPLUGGABLE.

Now when a device is added to "cold" config, instead of requiring 
HOTPLUGGABLE, we require AUTO_ASSIGNABLE, which will get the same 
assignment as before the change (except that the NO_HOTPLUG root ports 
will also be used). If we actually hotplug a device, then we continue to 
require the HOTPLUGGABLE connect flag.

So that much works, but we still have the problem that in order to add a 
device to the "cold" config that can later be hotplugged, we have to 
make sure all of the hotplug='no' controllers are in use.

I guess that will work, but it's getting pretty obtuse and complicated 
just to avoid adding a single flag that says "this device shouldn't be 
put in a hotplug-capable slot" directly to the XML for the device that 
you don't want hotplugged. It would be *much* simpler if qemu could 
provide an attribute for endpoint devices that would disable hotplug for 
that device, rather than requiring the option to be set on the 
controller that the device is plugged into. Then libvirt could just have 
that attribute added to the XML, and the management app would just 
simply add "hotplug='no'" to each device. It's too bad that the method 
vfio-pci-nohotplug uses apparently doesn't work for PCIe...

as applications still do *not* have to taken on
responsibility for full PCI device addressing. They merely need to
be able to count how many PCI devices they're using. The only "gotcha"
is if they forget about the auto-added USB, VGA and Balloon devices,
but that's not a big deal.

Regards,
Daniel