On Thu, 29 Jun 2017 09:33:19 +0100 "Daniel P. Berrange" <berrange@xxxxxxxxxx> wrote: > On Wed, Jun 28, 2017 at 07:24:55PM -0500, Michael Roth wrote: > > Hi everyone. Hoping to get some feedback on this approach, or some > > alternatives proposed below, to the following issue: > > > > Currently libvirt immediately attempts to rebind a managed device back to the > > host driver when it receives a DEVICE_DELETED event from QEMU. This is > > problematic for 2 reasons: > > > > 1) If multiple devices from a group are attached to a guest, this can move > > the group into a "non-viable" state where some devices are assigned to > > the host and some to the guest. > > > > 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase > > where additional cleanup occurs. In most cases libvirt can ignore this > > cleanup, but in the case of VFIO devices this is where closing of a VFIO > > group FD occurs, and failing to wait before rebinding the device to the > > host driver can result in unexpected behavior. In the case of powernv > > hosts at least, this can lead to a host driver crashing due to the default > > DMA windows not having been fully-restored yet. The window between this is > > and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've > > seen host dumps with Mellanox CX4 VFs being rebound to host driver during > > this period (on powernv hosts). I've been trying to tackle this from the kernel side too, currently Linux's driver model really neither allows vfio bus drivers to nak unbinding a device from an in-use group nor nak binding a device from an in-use group to an incompatible driver. The issue you identify in QEMU/libvirt exacerbates the problem as QEMU has not yet finalized the device/group references before libvirt tries to unbind the device from the vfio bus driver and attach it to a host driver. I'd love to solve this from both sides by allowing the kernel to prevent driver binds that we'd consider compromising and also introduce a bit of patience in the QEMU/libvirt path to avoid the kernel needing to impose that driver blocking. > Why on earth does QEMU's device finalization take 6 seconds to complete. > That feels very broken to me, unless QEMU is not being schedled due to > host being overcomitted. If that's not the case, then we have a bug to > investigate in QEMU to find out why cleanup is delayed so long. I wouldn't necessarily jump to the conclusion that this is a bug, if it's relating to tearing down the IOMMU mappings for the device, gigs of mappings can take non-trivial time. Is that the scenario here? Is that 6s somehow proportional to guest memory size? > From libvirt's POV, we consider 'DEVICE_DELETED' as meaning both that the > frontend has gone *and* the corresponding backend has gone. Aside from > cleaning the VFIO group, we use this as a trigger for all other device > related cleanup like SELinux labelling, cgroup device ACLs, etc. If the > backend is not guaranteed to be closed in QEMU when this emit is emitted > then either QEMU needs to delay the event until it is really cleaned up, > or QEMU needs to add a further event to emit when the backend is clean. Clearly libvirt and QEMU's idea of what DEVICE_DELETED means don't align. > > Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver > > until all the devices in the group have been detached, at which point all > > the hostdevs are rebound as a group. Until that point, the devices are traced > > by the drvManager's inactiveList in a similar manner to hostdevs that are > > assigned to VFIO via the nodedev-detach interface. There are certainly some benefits to group-awareness here, currently an admin user like libvirt can trigger a BUG_ON by trying to bind a device back to a host driver when a group is still in use, at best we might improve that to rejecting the compromising bind. > > Patch 5 addresses 2) by adding an additional check that, when the last device > > from a group is detached, polls /proc for open FDs referencing the VFIO group > > path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we > > time out, we abandon rebinding the hostdevs back to the host. > > That is just gross - it is tieing libvirt to details of the QEMU internal > implementation. I really don't think we should be doing that. So NACK to > this from my POV. It seems a little silly for QEMU to emit the event while it's still in use, clearly emitting the event at the right point would negate any need for snooping around in proc. > > There are a couple alternatives to Patch 5 that might be worth considering: > > > > a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of > > DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in > > favor of minimal changes to libvirt's event handlers. > > > > The downsides are: > > - that we'd incur some latency for all device-detach calls, but it's not > > apparent to me whether this delay is significant for anything outside > > of VFIO. > > - there may be cases where finalization after DEVICE_DELETE/unparent are > > is not guaranteed, and I'm not sure QOM would encourage such > > expectations even if that's currently the case. > > > > b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most > > direct solution. With this we could completely separate out the handling > > of rebinding to host driver based on receival of this event. > > > > The downsides are: > > - this would only work for newer versions of QEMU, though we could use > > the poll-wait in patch 5 as a fallback. > > - synchronizing sync/async device-detach threads with sync/async > > handlers for this would be a bit hairy, but I have a WIP in progress > > that seems *fairly reasonable* > > > > c) Take the approach in Patch 5, either as a precursor to implementing b) or > > something else, or just sticking with that for now. > > > > d) ??? > > Fix DEVICE_DELETE so its only emitted when the backend associated with > the device is fully cleaned up. Adding a FINALIZE seems to require a two-step fix, fix QEMU then fix libvirt, whereas moving DELETE to the correct location automatically fixes the behavior with existing libvirt. I don't know that a GROUP_DELETED makes much sense, libvirt can know about groups on its own and it just leads to a vfio specific path. Thanks, Alex -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list