On Wed, Dec 07, 2022 at 02:52:03PM +0100, Christoph Hellwig wrote: > On Wed, Dec 07, 2022 at 09:34:14AM -0400, Jason Gunthorpe wrote: > > The VFIO design assumes that the "vfio migration driver" will talk to > > both functions under the hood, and I don't see a fundamental problem > > with this beyond it being awkward with the driver core. > > And while that is a fine concept per see, the current incarnation of > that is fundamentally broken is it centered around the controlled > VM. Which really can't work. I don't see why you keep saying this. It is centered around the struct vfio_device object in the kernel, which is definately NOT the VM. The struct vfio_device is the handle for the hypervisor to control the physical assigned device - and it is the hypervisor that controls the migration. We do not need the hypervisor userspace to have a handle to the hidden controlling function. It provides no additional functionality, security or insight to what qemu needs to do. Keeping that relationship abstracted inside the kernel is a reasonable choice and is not "fundamentally broken". > > Even the basic assumption that there would be a controlling/controlled > > relationship is not universally true. The mdev type drivers, and > > SIOV-like devices are unlikely to have that. Once you can use PASID > > the reasons to split things at the HW level go away, and a VF could > > certainly self-migrate. > > Even then you need a controlling and a controlled entity. The > controlling entity even in SIOV remains a PCIe function. The > controlled entity might just be a bunch of hardware resoures and > a PASID. Making it important again that all migration is driven > by the controlling entity. If they are the same driver implementing vfio_device you may be able to claim they conceptually exist, but it is pretty artificial to draw this kind of distinction inside a single driver. > Also the whole concept that only VFIO can do live migration is > a little bogus. With checkpoint and restart it absolutely > does make sense to live migrate a container, and with that > the hardware interface (e.g. nvme controller) assigned to it. I agree people may want to do this, but it is very unclear how SRIOV live migration can help do this. SRIOV live migration is all about not disturbing the kernel driver, assuming it is the same kernel driver on both sides. If you have two different kernel's there is nothing worth migrating. There isn't even an assurance the dma API will have IOMMU mapped the same objects to the same IOVAs. eg so you have re-establish your admin queue, IO queues, etc after migration anyhow. Let alone how to solve the security problems of allow userspace to load arbitary FW blobs into a device with potentially insecure DMA access.. At that point it isn't really the same kind of migration. Jason