On Wed, Dec 07, 2022 at 04:08:02PM -0400, Jason Gunthorpe wrote: > However hisilicon managed to do their implementation without this, or > rather you could say their "controlling function" is a single MMIO BAR > page in their PCI VF and their "controlled function" is the rest of > the PCI VF. Eww. So you need to carefully filter the BAR and can't actually do any DMA at all? I'm not sure that is actually practical, especially not for something with a lot of state. > If the kernel knows this information then we can find a way for the > vfio_device to have pointers to both controlling and controlled > objects. I have a suggestion below. So now we need to write a vfio shim for every function even if there is absolutely nothing special about that function? Migrating really is the controlling functions behavior, and writing a new vfio bit for every controlled thing just does not scale. > I see it differently, the VFIO driver *is* the live migration > driver. Look at all the drivers that have come through and they are > 99% live migration code. Yes, and that's the problem, because they are associated with the controlled function, and now we have a communication problem between that vfio driver binding to the controlled function and the drive that actually controlls live migration that is associated with the controlling function. In other words: you've created a giant mess. > Excepting quirks and bugs sounds nice, except we actually can't ignore > them. I'm not proposing to ignore them. But they should not be needed most of the time. > For instance how do I trap FLR like mlx5 must do if the > drivers/live_migration code cannot intercept the FLR VFIO ioctl? > > How do I mangle and share the BAR like hisilicon does? > > Which is really why this is in VFIO in the first place. It actually is > coupled in practice, if not in theory. So you've created a long term userspace API around working around around buggy and/or misdesigned early designs and now want to force it down everyones throat? Can we please take a step back and think about how things should work, and only then think how to work around weirdo devices that do strange things as a second step? > If we accept that drivers/vfio can be the "live migration subsystem" > then lets go all the way and have the controlling driver to call > vfio_device_group_register() to create the VFIO char device for the > controlled function. While creating the VFs from the PF driver makes a lot more sense, remember that vfio is absolutely not the only use case for VFs. There are plenty use cases where you want to use them with the normal kernel driver as well. So the interface to create VFs needs a now to decide if it should be vfio exported, or use the normal kernel binding. > This solves the "sanely discover" problem because of course the > controlling function driver knows what the controlled function is and > it can acquire both functions before it calls > vfio_device_group_register(). Yes. > This is actually what I want to do anyhow for SIOV-like functions and > VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I > really dislike that our current SRIOV model in Linux forces the VF to > instantly exist without a chance for the controlling driver to > provision it. For SIOV you have no other choice anyway. But I agree that it is the right thing to do for VFIO. Now the next step is to control live migration from the controlling function, so that for most sane devices the controlled device does not need all the pointless boilerplate of its own vfio driver. > I'd really like to get away from VFIO having to do all this crazy > sysfs crap to activate its driver. I think there is a lot of appeal to > having, say, a nvmecli command that just commands the controlling > driver to provision a function, enable live migration, configure it > and then make it visible via VFIO. The same API regardless if the > underlying controlled function technology is PF/VF/SIOV. Yes.