On Tue, Oct 10, 2023 at 11:26:42PM -0700, Christoph Hellwig wrote: > On Tue, Oct 10, 2023 at 10:10:31AM -0300, Jason Gunthorpe wrote: > > We've talked around ideas like allowing the VF config space to do some > > of the work. For simple devices we could get away with 1 VF config > > space register. (VF config space is owned by the hypervisor, not the > > guest) > > Which assumes you're actually using VFs and not multiple PFs, which > is a very limiting assumption. ? It doesn't matter VF/PF, the same functions config space could do simple migration. > It also limits your from actually > using DMA during the live migration process, which again is major > limitation once you have a non-tivial amount of state. Yes, this is a dealbreaker for big cases. But we do see several smaller/simpler devices that don't use DMA in their migration. > > SIOVr2 is discussing more a flexible RID mapping - there is a possible > > route where a "VF" could actually have two RIDs, a hypervisor RID and a > > guest RID. > > Well, then you go down the SIOV route, which requires a complex driver > actually presenting the guest visible device anyway. Yep > Independent of my above points on the doubts on VF-controlled live > migration for PCe device I absolutely agree with your that the Linux > abstraction and user interface should be VF based. Which further > reinforeces my point that the VFIO driver for the controlled function > (PF or VF) and the Linux driver for the controlling function (better > be a PF in practice) must be very tightly integrated. And the best > way to do that is to export the vfio nodes from the Linux driver > that knowns the hardware and not split out into a separate one. I'm not sure how we get to "very tightly integrated". We have many examples of live migration vfio drivers now and they do not seem to require tight integration. The PF driver only has to provide a way to execute a small number of proxied operations. Regardless, I'm not too fussed about what directory the implementation lives in, though I do prefer the current arrangement where VFIO only stuff is in drivers/vfio. I like the process we have where subsystems are responsible for the code that implements the subsystem ops. > > However - the Intel GPU VFIO driver is such a bad experiance I don't > > want to encourage people to make VFIO drivers, or code that is only > > used by VFIO drivers, that are not under drivers/vfio review. > > We can and should require vfio review for users of the vfio API. > But to be honest code placement was not the problem with i915. The > problem was that the mdev APIs (under drivers/vfio) were a complete > trainwreck when it was written, and that the driver had a horrible > hypervisor API abstraction. E800 also made some significant security mistakes that VFIO side caught. I think would have been missed if it went into a netdev tree. Even unrelated to mdev, Intel GPU is still not using the vfio side properly, and the way it hacked into KVM to try to get page tracking is totally logically wrong (but Works For Me (tm)) Aside from technical concerns, I do have a big process worry here. vfio is responsible for the security side of the review of things implementing its ops. > > Be aware, there is a significant performance concern here. If you want > > to create 1000 VFIO devices (this is a real thing), we *can't* probe a > > normal driver first, it is too slow. We need a path that goes directly > > from creating the RIDs to turning those RIDs into VFIO. > > And by calling the vfio funtions from mlx5 you get this easily. "easily" I don't know about that :) > For mdev/SIOV like flows you must call vfio APIs from the main > driver anyway, as there is no pci_dev to probe on anyway. That's > what i915 does btw. IMHO i915 is not an good example to copy. mlx5 is already much closer to your ideal, and I would hold up as the right general direction for SIOV/mdev/etc, as we basically already do a lot of SIOV ideas. mlx5 is a multi-subsystem device. It has driver components in net, VDPA and infiniband. It can create non-PCI "functions". It is not feasible, process wise, for all of this to live under one directory. We *want* the driver split up by subystem and subsystem maintainer. So, we created the auxiliary_device stuff to manage this. It can do what you are imagining, I think. The core PF/VF driver is in charge of what to carve off to a sub system driver. IIRC mlx5 uses netlink to deliver commands to trigger this (eg create a VDPA device). An auxilary_device is created and the target subsystem driver probes to that and autoloads. eg see drivers/vdpa/mlx5/net/mlx5_vnet.c They are not 'tightly coupled', the opposite really. The auxilary_device comes along with a mlx5 API that allows all the subsystem to do what they need on the HW mostly independently. For mlx5 this is mostly a way to execute FW RPC commands. So, if you want to turn the VFIO stuff inside out, I'd still suggest to have the VFIO driver part under drivers/vfio and probe to an auxilary_device that represents the aspect of the HW to turn into VFIO (or VPDA, or whatever). The 'core' driver can provide an appropriate API between its VFIO part and its core part. We lack a common uAPI to trigger this creation, but otherwise the infrastructure exists and works well now. It allows subsystems to remain together and complex devices to spread their functionality to multiple subsystems. The current pci_iov_get_pf_drvdata() hack in VFIO is really a short cut to doing the auxilary_device stuff. (actually we tried to build this with auxilary_device first, it did not work out, needs more driver core infastructure). I can easially imagine all the current VFIO drivers probing to an auxilary_device and obtinaing the VF pci_device and the handle for the core functionalitty directly without the pci_iov_get_pf_drvdata() approach. > For "classic" vfio that requires a pci_dev (or $otherbus_dev) we need > to have a similar flow. And I think the best way is to have the > bus-level attribute on the device and/or a device-specific side band > protocol to device how new functions are probed. With that you > avoid all the duplicate PCI IDs for the binding, and actually allow to > sanely establush a communication channel between the functions. > Because without that there is no way to know how any two functions > related. The driver might think they know, but there's all kinds of > whacky PCI passthough schemes that will break such a logic. Yes, if things are not simple PF/VF then Linux struggles at the driver core level. auxilary_devices are a way out of that since one spot can figure out how to assemble the multi-component device and then delegate portions of the HW to other subsystems. If something wants to probe its own driver to a PF/VF to assemble the components it can do that and then bundle it up into an aux device and trigger a VFIO/etc driver to run on that bundle of resources. We don't *need* to put all the VFIO code someplace else to put the control over slicing the HW into a shared core driver. mlx5 and several other drivers now already demonstrates all of this. Jason