Re: [PATCH vfio 10/11] vfio/virtio: Expose admin commands over virtio device

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 11 Oct 2023 10:57:09 -0300

On Tue, Oct 10, 2023 at 11:26:42PM -0700, Christoph Hellwig wrote:
> On Tue, Oct 10, 2023 at 10:10:31AM -0300, Jason Gunthorpe wrote:
> > We've talked around ideas like allowing the VF config space to do some
> > of the work. For simple devices we could get away with 1 VF config
> > space register. (VF config space is owned by the hypervisor, not the
> > guest)
> 
> Which assumes you're actually using VFs and not multiple PFs, which
> is a very limiting assumption.  

? It doesn't matter VF/PF, the same functions config space could do
simple migration.

> It also limits your from actually
> using DMA during the live migration process, which again is major
> limitation once you have a non-tivial amount of state.

Yes, this is a dealbreaker for big cases. But we do see several
smaller/simpler devices that don't use DMA in their migration.

> > SIOVr2 is discussing more a flexible RID mapping - there is a possible
> > route where a "VF" could actually have two RIDs, a hypervisor RID and a
> > guest RID.
> 
> Well, then you go down the SIOV route, which requires a complex driver
> actually presenting the guest visible device anyway.

Yep

> Independent of my above points on the doubts on VF-controlled live
> migration for PCe device I absolutely agree with your that the Linux
> abstraction and user interface should be VF based.  Which further
> reinforeces my point that the VFIO driver for the controlled function
> (PF or VF) and the Linux driver for the controlling function (better
> be a PF in practice) must be very tightly integrated.  And the best
> way to do that is to export the vfio nodes from the Linux driver
> that knowns the hardware and not split out into a separate one.

I'm not sure how we get to "very tightly integrated". We have many
examples of live migration vfio drivers now and they do not seem to
require tight integration. The PF driver only has to provide a way to
execute a small number of proxied operations.

Regardless, I'm not too fussed about what directory the implementation
lives in, though I do prefer the current arrangement where VFIO only
stuff is in drivers/vfio. I like the process we have where subsystems
are responsible for the code that implements the subsystem ops.

> > However - the Intel GPU VFIO driver is such a bad experiance I don't
> > want to encourage people to make VFIO drivers, or code that is only
> > used by VFIO drivers, that are not under drivers/vfio review.
> 
> We can and should require vfio review for users of the vfio API.
> But to be honest code placement was not the problem with i915.  The
> problem was that the mdev APIs (under drivers/vfio) were a complete
> trainwreck when it was written, and that the driver had a horrible
> hypervisor API abstraction.

E800 also made some significant security mistakes that VFIO side
caught. I think would have been missed if it went into a netdev
tree.

Even unrelated to mdev, Intel GPU is still not using the vfio side
properly, and the way it hacked into KVM to try to get page tracking
is totally logically wrong (but Works For Me (tm))

Aside from technical concerns, I do have a big process worry
here. vfio is responsible for the security side of the review of
things implementing its ops.

> > Be aware, there is a significant performance concern here. If you want
> > to create 1000 VFIO devices (this is a real thing), we *can't* probe a
> > normal driver first, it is too slow. We need a path that goes directly
> > from creating the RIDs to turning those RIDs into VFIO.
> 
> And by calling the vfio funtions from mlx5 you get this easily.

"easily" I don't know about that :)

> For mdev/SIOV like flows you must call vfio APIs from the main
> driver anyway, as there is no pci_dev to probe on anyway.  That's
> what i915 does btw.

IMHO i915 is not an good example to copy.

mlx5 is already much closer to your ideal, and I would hold up as the
right general direction for SIOV/mdev/etc, as we basically already do
a lot of SIOV ideas.

mlx5 is a multi-subsystem device. It has driver components in net,
VDPA and infiniband. It can create non-PCI "functions".

It is not feasible, process wise, for all of this to live under one
directory. We *want* the driver split up by subystem and subsystem
maintainer.

So, we created the auxiliary_device stuff to manage this. It can do
what you are imagining, I think.

The core PF/VF driver is in charge of what to carve off to a sub
system driver. IIRC mlx5 uses netlink to deliver commands to trigger
this (eg create a VDPA device). An auxilary_device is created and the
target subsystem driver probes to that and autoloads. eg see
drivers/vdpa/mlx5/net/mlx5_vnet.c

They are not 'tightly coupled', the opposite really. The
auxilary_device comes along with a mlx5 API that allows all the
subsystem to do what they need on the HW mostly independently. For
mlx5 this is mostly a way to execute FW RPC commands.

So, if you want to turn the VFIO stuff inside out, I'd still suggest
to have the VFIO driver part under drivers/vfio and probe to an
auxilary_device that represents the aspect of the HW to turn into VFIO
(or VPDA, or whatever). The 'core' driver can provide an appropriate
API between its VFIO part and its core part.

We lack a common uAPI to trigger this creation, but otherwise the
infrastructure exists and works well now. It allows subsystems to
remain together and complex devices to spread their functionality to
multiple subsystems.

The current pci_iov_get_pf_drvdata() hack in VFIO is really a short
cut to doing the auxilary_device stuff. (actually we tried to build
this with auxilary_device first, it did not work out, needs more
driver core infastructure).

I can easially imagine all the current VFIO drivers probing to an
auxilary_device and obtinaing the VF pci_device and the handle for the
core functionalitty directly without the pci_iov_get_pf_drvdata()
approach.

> For "classic" vfio that requires a pci_dev (or $otherbus_dev) we need
> to have a similar flow.  And I think the best way is to have the
> bus-level attribute on the device and/or a device-specific side band
> protocol to device how new functions are probed.  With that you
> avoid all the duplicate PCI IDs for the binding, and actually allow to
> sanely establush a communication channel between the functions.
> Because without that there is no way to know how any two functions
> related.  The driver might think they know, but there's all kinds of
> whacky PCI passthough schemes that will break such a logic.

Yes, if things are not simple PF/VF then Linux struggles at the driver
core level. auxilary_devices are a way out of that since one spot can
figure out how to assemble the multi-component device and then
delegate portions of the HW to other subsystems.

If something wants to probe its own driver to a PF/VF to assemble the
components it can do that and then bundle it up into an aux device and
trigger a VFIO/etc driver to run on that bundle of resources.

We don't *need* to put all the VFIO code someplace else to put the
control over slicing the HW into a shared core driver. mlx5 and
several other drivers now already demonstrates all of this.

Jason