On Tue, 2 Feb 2021 08:28:52 +0100 Erik Skultety <eskultet@xxxxxxxxxx> wrote: > On Mon, Feb 01, 2021 at 04:38:56PM -0700, Alex Williamson wrote: > > On Mon, 1 Feb 2021 16:57:44 -0600 > > Jonathon Jongsma <jjongsma@xxxxxxxxxx> wrote: > > > > > On Mon, 1 Feb 2021 11:33:08 +0100 > > > Erik Skultety <eskultet@xxxxxxxxxx> wrote: > > > > > > > On Mon, Feb 01, 2021 at 09:48:11AM +0000, Daniel P. Berrangé wrote: > > > > > On Mon, Feb 01, 2021 at 10:45:43AM +0100, Erik Skultety wrote: > > > > > > On Mon, Feb 01, 2021 at 09:42:32AM +0000, Daniel P. Berrangé > > > > > > wrote: > > > > > > > On Fri, Jan 29, 2021 at 05:34:36PM -0600, Jonathon Jongsma > > > > > > > wrote: > > > > > > > > On Thu, 7 Jan 2021 17:43:54 +0100 > > > > > > > > Erik Skultety <eskultet@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > Tested with v6.10.0-283-g1948d4e61e. > > > > > > > > > > > > > > > > > > > > 1.Can define/start/destroy mdev device successfully; > > > > > > > > > > > > > > > > > > > > 2.'virsh nodedev-list' has no '--active' option, which is > > > > > > > > > > inconsistent with the description in the patch: > > > > > > > > > > # virsh nodedev-list --active > > > > > > > > > > error: command 'nodedev-list' doesn't support option > > > > > > > > > > --active > > > > > > > > > > > > > > > > > > > > 3.virsh client hang when trying to destroy a mdev device > > > > > > > > > > which is using by a vm, and after that all 'virsh nodev*' > > > > > > > > > > cmds will hang. If restarting llibvirtd after that, > > > > > > > > > > libvirtd will hang. > > > > > > > > > > > > > > > > > > It hangs because underneath a write to the 'remove' sysfs > > > > > > > > > attribute is now blocking for some reason and since we're > > > > > > > > > relying on mdevctl to do it for us, hence "it hangs". I'm > > > > > > > > > not trying to make an excuse, it's plain wrong. I'd love to > > > > > > > > > rely on such a basic functionality, but it looks like we'll > > > > > > > > > have to go with a extremely ugly workaround and try to get > > > > > > > > > the list of active domains from the nodedev driver and see > > > > > > > > > whether any of them has the device assigned before we try > > > > > > > > > to destroy the mdev via the nodedev driver. > > > > > > > > > > > > > > > > So, I've been trying to figure out a way to do this, but as > > > > > > > > far as I know, there's no way to get a list of active domains > > > > > > > > from within the nodedev driver, and I can't think of any > > > > > > > > better ways to handle it. Any ideas? > > > > > > > > > > > > > > Correct, the nodedev driver isn't permitted to talk to any of > > > > > > > the virt drivers. > > > > > > > > > > > > Oh, not even via secondary connection? What makes nodedev so > > > > > > special, since we can open a secondary connection from e.g. the > > > > > > storage driver? > > > > > > > > > > It is technically possible, but it should never be done, because it > > > > > introduces a bi-directional dependancy between the daemons which > > > > > introduces the danger of deadlocking them. None of the secondary > > > > > drivers should connect to the hypervisor drivers. > > > > > > > > > > > > Is there anything in sysfs which reports whether the device is > > > > > > > in use ? > > > > > > > > > > > > Nothing that I know of, the way it used to work was that you > > > > > > tried to write to sysfs and kernel returned a write error with > > > > > > "device in use" or something like that, but that has changed > > > > > > since :(. > > > > > > > > Without having tried this and since mdevctl is just a Bash script, > > > > can we bypass mdevctl on destroys a little bit by constructing the > > > > path to the sysfs attribute ourselves and perform a non-blocking > > > > write of zero bytes to see if we get an error? If so, we'll skip > > > > mdevctl and report an error. If we don't, we'll invoke mdevctl to > > > > remove the device in order to remain consistent. Would that be an > > > > acceptable workaround (provided it would work)? > > > > > > As far as I can tell, this doesn't work. According to my > > > tests, attempting to write zero bytes to $(mdev_sysfs_path)/remove > > > doesn't result in an error if the mdev is in use by a vm. It just > > > "successfully" writes zero bytes. Adding Alex to cc in case he has an > > > idea for a workaround here. > > > > [Cc +Connie] > > > > I'm not really sure why mdevs are unique here. When we write to > > remove, the first step is to release the device from the driver, so > > it's really the same as an unbind for a vfio-pci device. PCI drivers > > cannot return an error, an unbind is handled not as a request, but a > > directive, so when the device is in use we block until the unbind can > > complete. With vfio-pci (and added upstream to the mdev core - > > depending on vendor support), the driver remove callback triggers a > > virtual interrupt to the user asking to cooperatively return the device > > (triggering a hot-unplug in QEMU). Has this really worked so well in > > vfio-pci that we've forgotten that an unbind can block there too or are > > we better about tracking something with PCI devices vs mdevs? > > Does any of the current vendor guest drivers for mdev support unplug? While > I'm not trying to argue that unpluging a vfio-pci cannot block, it just works > seamlessly in majority of cases nowadays, but I guess we were in the same > situation with PCI assignment in the past? I think this will only work for ccw right now (when using QEMU from current git). Channel devices always support unplug (it's not a request on the guest side, it's a notification.) > > The whole point here is IMO about a massive inconvenience for a library > consumer to be blocked on an operation and not knowing why, whereas when you > return an instant error saying why the operation cannot be completed right now > that opens the door for a necessary adjustment in their usage of the library. > > > > > On idea for a solution would be that vfio only allows a single open of a > > group at a time, so if libvirt were to open the group it could know > > that it's unused. If you can manage to close the group once you've > > already triggered the remove/unbind, then I'd think the completion of > > the write would be deterministic. If the group is in use elsewhere, > > the open should get back an -EBUSY and you probably ought to be careful > > Honestly, ^this seems like a fairly straightforward workaround to me. > > Erik > > > about removing/unbinding it anyway. It might be possible to implement > > this in mdevctl too, ie. redirect /dev/null to group file and fork, > > fork the echo 1 > remove, kill the redirect, return a device in use > > error if the initial redirect fails. Thanks, > > > > Alex Hacking something like that into mdevctl might be a good idea.