Re: [libvirt PATCH v3 00/21] Add support for persistent mediated devices

Jonathon Jongsma <jjongsma@xxxxxxxxxx> · Tue, 2 Feb 2021 16:35:33 -0600

On Tue, 2 Feb 2021 08:28:52 +0100
Erik Skultety <eskultet@xxxxxxxxxx> wrote:

> On Mon, Feb 01, 2021 at 04:38:56PM -0700, Alex Williamson wrote:
> > On Mon, 1 Feb 2021 16:57:44 -0600
> > Jonathon Jongsma <jjongsma@xxxxxxxxxx> wrote:
> >   
> > > On Mon, 1 Feb 2021 11:33:08 +0100
> > > Erik Skultety <eskultet@xxxxxxxxxx> wrote:
> > >   
> > > > On Mon, Feb 01, 2021 at 09:48:11AM +0000, Daniel P. Berrangé
> > > > wrote:    
> > > > > On Mon, Feb 01, 2021 at 10:45:43AM +0100, Erik Skultety
> > > > > wrote:      
> > > > > > On Mon, Feb 01, 2021 at 09:42:32AM +0000, Daniel P. Berrangé
> > > > > > wrote:      
> > > > > > > On Fri, Jan 29, 2021 at 05:34:36PM -0600, Jonathon Jongsma
> > > > > > > wrote:      
> > > > > > > > On Thu, 7 Jan 2021 17:43:54 +0100
> > > > > > > > Erik Skultety <eskultet@xxxxxxxxxx> wrote:
> > > > > > > >       
> > > > > > > > > > Tested with v6.10.0-283-g1948d4e61e.
> > > > > > > > > > 
> > > > > > > > > > 1.Can define/start/destroy mdev device successfully;
> > > > > > > > > > 
> > > > > > > > > > 2.'virsh nodedev-list' has no '--active' option,
> > > > > > > > > > which is inconsistent with the description in the
> > > > > > > > > > patch： # virsh nodedev-list --active
> > > > > > > > > > error: command 'nodedev-list' doesn't support option
> > > > > > > > > > --active
> > > > > > > > > > 
> > > > > > > > > > 3.virsh client hang when trying to destroy a mdev
> > > > > > > > > > device which is using by a vm, and after that all
> > > > > > > > > > 'virsh nodev*' cmds will hang. If restarting
> > > > > > > > > > llibvirtd after that, libvirtd will hang.        
> > > > > > > > > 
> > > > > > > > > It hangs because underneath a write to the 'remove'
> > > > > > > > > sysfs attribute is now blocking for some reason and
> > > > > > > > > since we're relying on mdevctl to do it for us, hence
> > > > > > > > > "it hangs". I'm not trying to make an excuse, it's
> > > > > > > > > plain wrong. I'd love to rely on such a basic
> > > > > > > > > functionality, but it looks like we'll have to go
> > > > > > > > > with a extremely ugly workaround and try to get the
> > > > > > > > > list of active domains from the nodedev driver and
> > > > > > > > > see whether any of them has the device assigned
> > > > > > > > > before we try to destroy the mdev via the nodedev
> > > > > > > > > driver.      
> > > > > > > > 
> > > > > > > > So, I've been trying to figure out a way to do this,
> > > > > > > > but as far as I know, there's no way to get a list of
> > > > > > > > active domains from within the nodedev driver, and I
> > > > > > > > can't think of any better ways to handle it. Any ideas?
> > > > > > > >      
> > > > > > > 
> > > > > > > Correct, the nodedev driver isn't permitted to talk to
> > > > > > > any of the virt drivers.      
> > > > > > 
> > > > > > Oh, not even via secondary connection? What makes nodedev so
> > > > > > special, since we can open a secondary connection from e.g.
> > > > > > the storage driver?      
> > > > > 
> > > > > It is technically possible, but it should never be done,
> > > > > because it introduces a bi-directional dependancy between the
> > > > > daemons which introduces the danger of deadlocking them. None
> > > > > of the secondary drivers should connect to the hypervisor
> > > > > drivers. 
> > > > > > > Is there anything in sysfs which reports whether the
> > > > > > > device is in use ?      
> > > > > > 
> > > > > > Nothing that I know of, the way it used to work was that you
> > > > > > tried to write to sysfs and kernel returned a write error
> > > > > > with "device in use" or something like that, but that has
> > > > > > changed since :(.      
> > > > 
> > > > Without having tried this and since mdevctl is just a Bash
> > > > script, can we bypass mdevctl on destroys a little bit by
> > > > constructing the path to the sysfs attribute ourselves and
> > > > perform a non-blocking write of zero bytes to see if we get an
> > > > error? If so, we'll skip mdevctl and report an error. If we
> > > > don't, we'll invoke mdevctl to remove the device in order to
> > > > remain consistent. Would that be an acceptable workaround
> > > > (provided it would work)?    
> > > 
> > > As far as I can tell, this doesn't work. According to my
> > > tests, attempting to write zero bytes to $(mdev_sysfs_path)/remove
> > > doesn't result in an error if the mdev is in use by a vm. It just
> > > "successfully" writes zero bytes. Adding Alex to cc in case he
> > > has an idea for a workaround here.  
> > 
> > [Cc +Connie]
> > 
> > I'm not really sure why mdevs are unique here.  When we write to
> > remove, the first step is to release the device from the driver, so
> > it's really the same as an unbind for a vfio-pci device.  PCI
> > drivers cannot return an error, an unbind is handled not as a
> > request, but a directive, so when the device is in use we block
> > until the unbind can complete.  With vfio-pci (and added upstream
> > to the mdev core - depending on vendor support), the driver remove
> > callback triggers a virtual interrupt to the user asking to
> > cooperatively return the device (triggering a hot-unplug in QEMU).
> > Has this really worked so well in vfio-pci that we've forgotten
> > that an unbind can block there too or are we better about tracking
> > something with PCI devices vs mdevs?  
> 
> Does any of the current vendor guest drivers for mdev support unplug?
> While I'm not trying to argue that unpluging a vfio-pci cannot block,
> it just works seamlessly in majority of cases nowadays, but I guess
> we were in the same situation with PCI assignment in the past?
> 
> The whole point here is IMO about a massive inconvenience for a
> library consumer to be blocked on an operation and not knowing why,
> whereas when you return an instant error saying why the operation
> cannot be completed right now that opens the door for a necessary
> adjustment in their usage of the library.
> 
> > 
> > On idea for a solution would be that vfio only allows a single open
> > of a group at a time, so if libvirt were to open the group it could
> > know that it's unused.  If you can manage to close the group once
> > you've already triggered the remove/unbind, then I'd think the
> > completion of the write would be deterministic.  If the group is in
> > use elsewhere, the open should get back an -EBUSY and you probably
> > ought to be careful  
> 
> Honestly, ^this seems like a fairly straightforward workaround to me.
> 
> Erik

Yes, thanks. This seems pretty clean and simple to to implement, and I
can't really think of any significant downsides.

Jonathon

> 
> > about removing/unbinding it anyway.  It might be possible to
> > implement this in mdevctl too, ie. redirect /dev/null to group file
> > and fork, fork the echo 1 > remove, kill the redirect, return a
> > device in use error if the initial redirect fails.  Thanks,
> > 
> > Alex