On 1/18/21 2:53 PM, Daniel Henrique Barboza wrote:
Gitlab issue #72 [1] reports that removing SR-IOVs VFs before
removing the devices from the running domains can have strange
consequences. QEMU might be able to hotunplug the device inside the
guest, but Libvirt will not be aware of that, and then the guest is
now inconsistent with the domain definition.
There's also the possibility of the VFs removal not succeeding
while the domain is running but then, as soon as the domain
is shutdown, all the VFs are removed. Libvirt can't handle
the removal of the PCI devices while trying to reattach the
hostdevs, and the Libvirt daemon can be left in an inconsistent
state (see [2]).
This patch starts to address the issue related in Gitlab #72, most
notably the issue described in [2]. When shutting down a domain
with SR-IOV hostdevs that got missing, virHostdevReAttachPCIDevices()
is failing the whole process and failing to reattach all the
PCI devices, including the ones that aren't related to the VFs that
went missing. Let's make it more resilient with host changes by
changing virHostdevGetPCIHostDevice() to return an exclusive error
code '-2' for this case. virHostdevGetPCIHostDeviceList() can then
tell when virHostdevGetPCIHostDevice() failed to find the PCI
device of a hostdev and continue to make the list of PCI devices.
virHostdevReAttachPCIDevices() will now be able to proceed reattaching
all other valid PCI devices, at least. The 'ghost hostdevs' will be
handled later on.
[1] https://gitlab.com/libvirt/libvirt/-/issues/72
[2] https://gitlab.com/libvirt/libvirt/-/issues/72#note_459032148
Signed-off-by: Daniel Henrique Barboza <danielhb413@xxxxxxxxx>
---
src/hypervisor/virhostdev.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/src/hypervisor/virhostdev.c b/src/hypervisor/virhostdev.c
index bd35397f2c..dbba36193b 100644
--- a/src/hypervisor/virhostdev.c
+++ b/src/hypervisor/virhostdev.c
@@ -220,7 +220,8 @@ virHostdevManagerGetDefault(void)
* is returned.
*
* Returns: 0 on success (@pci might be NULL though),
- * -1 otherwise (with error reported).
+ * -1 otherwise (with error reported),
+ * -2 PCI device not found. @pci will be NULL
*/
static int
virHostdevGetPCIHostDevice(const virDomainHostdevDef *hostdev,
@@ -235,6 +236,9 @@ virHostdevGetPCIHostDevice(const virDomainHostdevDef *hostdev,
hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI)
return 0;
+ if (!virPCIDeviceExists(&pcisrc->addr))
+ return -2;
You've created a special return code to mean "This is a PCI device, but
it isn't present on the host"...
+
actual = virPCIDeviceNew(&pcisrc->addr);
if (!actual)
@@ -270,7 +274,7 @@ virHostdevGetPCIHostDeviceList(virDomainHostdevDefPtr *hostdevs, int nhostdevs)
virDomainHostdevDefPtr hostdev = hostdevs[i];
g_autoptr(virPCIDevice) pci = NULL;
- if (virHostdevGetPCIHostDevice(hostdev, &pci) < 0)
+ if (virHostdevGetPCIHostDevice(hostdev, &pci) == -1)
return NULL;
...But you haven't actually used it. Will it become necessary later to
have this special value? right now a missing device is treated exactly
the same as if it isn't a PCI device.
I guess I can understand the conceptual desire to return an error of
some type in this case though, and there are other places where
something similar is done (-2 indicates some type of "odd" error), so
I'll let it pass :-)
Reviewed-by: Laine Stump <laine@xxxxxxxxxx>
if (!pci)