On Wed, Dec 07, 2022 at 08:54:15AM +0100, Christoph Hellwig wrote: > On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote: > > What the kernel is doing is providing the abstraction to link the > > controlling function to the VFIO device in a general way. > > > > We don't want to just punt this problem to user space and say 'good > > luck finding the right cdev for migration control'. If the kernel > > struggles to link them then userspace will not fare better on its own. > > Yes. But the right interface for that is to issue the userspace > commands for anything that is not normal PCIe function level > to the controlling funtion, and to discover the controlled functions > based on the controlling functions. I don't think we should mix up how the HW works, what PCI function the commands are executed at, with how the uAPI is designed. The VFIO design assumes that the "vfio migration driver" will talk to both functions under the hood, and I don't see a fundamental problem with this beyond it being awkward with the driver core. It is done this way deliberately. When we did the work on it we found there are problems with some device models, like when you suspend them and then wrongly MMIO touch them they can trigger AER and maybe even a machine check crash. One of the roles of the VFIO driver is to fix these HW bugs and make the uAPI safe. Eg by revoking mmaps or whatever. Even more importantly, we do not want migration to ever be operated unless VFIO is in control of the device. In general, migration resume basically allows userspace to command the device to do effectively any DMA. This is a kernel security problem if the device is not constrained by a user iommu_domain - for security we must not allow userspace to resume a VF that is under kernel control and potentially linked to an passthrough iommu_domain. VFIO provides the security model to make all of this safe - the two concepts cannot become disconnected. Even if we create a new migration uAPI it just means that the nvme driver has to be awkwardly aware of VFIO VF drivers and interlock with their state, and the uAPI is useless unless you already have a VFIO FD open. Even the basic assumption that there would be a controlling/controlled relationship is not universally true. The mdev type drivers, and SIOV-like devices are unlikely to have that. Once you can use PASID the reasons to split things at the HW level go away, and a VF could certainly self-migrate. VFIO's goal is to abstract all of the above stuff. You get one char device that inherently provides the security guarentees required to operate migration. It provides all the necessary hooks to fix up HW issues, which so far every merged device has: - mlx5 has a weird issue where FLR on the VF resets the migration context, we fix that in the VFIO driver (in mlx5 even though the commands are issued via the PF they are logically executed inside the VF context) - hi-silicon has security problems because it doesn't have controlling/controlled, so it needs to carve out BAR MMIO maps and other stuff So while I agree that, in principle, migration and the VFIO VF are seperate concerns our broken reality makes them linked. Even the idea that started this thread - that PF/PF could be a problem - seems to have been explored by the Pensando RFC which is using the aux devices to connect arbitrary PF/PF for their model. The logical model we have been using on complex multi-function devices (like every DPU driver) has been to split the driver into subsystem local code and thread all the pieces together with aux devices. The model has a PCI driver that operates the lowest level of the device, eg the 'admin queue' and then aux devices create subsystem-local drivers (netdev, rdma, vdpa, iscsi, vfio, etc, etc) that ride on the common API exported by the PCI driver. So, when you see both Intel and Pensando proposing this kind of layered model for NVMe where migration is subsystem-local to VFIO, I think this is where the inspiration is coming from. Their native DPU drivers already work this way. Jason