Re: [PATCH vfio 10/11] vfio/virtio: Expose admin commands over virtio device

Jason Gunthorpe <jgg@xxxxxxxxxx> · Tue, 10 Oct 2023 10:10:31 -0300

On Fri, Oct 06, 2023 at 06:09:09AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 05, 2023 at 08:10:04AM -0300, Jason Gunthorpe wrote:
> > > But for all the augmented vfio use cases it doesn't, for them the
> > > augmented vfio functionality is an integral part of the core driver.
> > > That is true for nvme, virtio and I'd argue mlx5 as well.
> > 
> > I don't agree with this. I see the extra functionality as being an
> > integral part of the VF and VFIO. The PF driver is only providing a
> > proxied communication channel.
> > 
> > It is a limitation of PCI that the PF must act as a proxy.
> 
> For anything live migration it very fundamentally is not, as a function
> that is visible to a guest by definition can't drive the migration
> itself.  That isn't really a limitation in PCI, but follows form the
> fact that something else must control a live migration that is
> transparent to the guest.

We've talked around ideas like allowing the VF config space to do some
of the work. For simple devices we could get away with 1 VF config
space register. (VF config space is owned by the hypervisor, not the
guest)

Devices that need DMA as part of their migration could be imagined to
co-opt a VF PASID or something. eg using ENQCMD.

SIOVr2 is discussing more a flexible RID mapping - there is a possible
route where a "VF" could actually have two RIDs, a hypervisor RID and a
guest RID.

It really is PCI limitations that force this design of making a PF
driver do dual duty as a fully functionally normal device and act as a
communication channel proxy to make a back channel into a SRIOV VF.

My view has always been that the VFIO live migration operations are
executed logically within the VF as they only effect the VF.

So we have a logical design seperation where VFIO world owns the
commands and the PF driver supplies the communication channel. This
works well for devices that already have a robust RPC interface to
their device FW.

> > ?? We must bind something to the VF's pci_driver, what do you imagine
> > that is?
> 
> The driver that knows this hardware.  In this case the virtio subsystem,
> in case of nvme the nvme driver, and in case of mlx5 the mlx5 driver.

But those are drivers operating the HW to create kernel devices. Here
we need a VFIO device. They can't co-exist, if you switch mlx5 from
normal to vfio you have to tear down the entire normal driver.

> > > E.g. for this case there should be no new vfio-virtio device, but
> > > instead you should be able to switch the virtio device to an
> > > fake-legacy vfio mode.
> > 
> > Are you aruging about how we reach to vfio_register_XX() and what
> > directory the file lives?
> 
> No.  That layout logically follows from what codebase the functionality
> is part of, though.

I don't understand what we are talking about really. Where do you
imagine the vfio_register_XX() goes?

> > I don't know what "fake-legacy" even means, VFIO is not legacy.
> 
> The driver we're talking about in this thread fakes up a virtio_pci
> legacy devie to the guest on top of a "modern" virtio_pci device.

I'm not sure I'd use the word fake, inb/outb are always trapped
operations in VMs. If the device provided a real IO BAR then VFIO
common code would trap and relay inb/outb to the device.

All this is doing is changing the inb/outb relay from using a physical
IO BAR to a DMA command ring.

The motivation is simply because normal IO BAR space is incredibly
limited and you can't get enough SRIOV functions when using it.

> > There is alot of code in VFIO and the VMM side to take a VF and turn
> > it into a vPCI function. You can't just trivially duplicate VFIO in a
> > dozen drivers without creating a giant mess.
> 
> I do not advocate for duplicating it.  But the code that calls this
> functionality belongs into the driver that deals with the compound
> device that we're doing this work for.

On one hand, I don't really care - we can put the code where people
like.

However - the Intel GPU VFIO driver is such a bad experiance I don't
want to encourage people to make VFIO drivers, or code that is only
used by VFIO drivers, that are not under drivers/vfio review.

> > Further, userspace wants consistent ways to operate this stuff. If we
> > need a dozen ways to activate VFIO for every kind of driver that is
> > not a positive direction.
> 
> We don't need a dozen ways.  We just need a single attribute on the
> pci (or $OTHERBUS) devide that switches it to vfio mode.

Well, we sort of do these days, it is just a convoluted bind thing.

Realistically switching modes requires unprobing the entire normal VF
driver. Having this be linked to the driver core probe/unprobe flows
is a good code reuse thing, IMHO.

We already spent alot of effort making this quite general from the
userspace perspective. Nobody yet came up with an idea to avoid the
ugly unbind/bind flow.

Be aware, there is a significant performance concern here. If you want
to create 1000 VFIO devices (this is a real thing), we *can't* probe a
normal driver first, it is too slow. We need a path that goes directly
from creating the RIDs to turning those RIDs into VFIO.

mlx5 takes *seconds* to complete its normal probe. We must avoid this.

Looking a few years into the future, with SIOVr1/2, the flow I want to
target is some uAPI commands:
  'create a PCI RID with params XYZ and attach a normal/VFIO/etc driver'
  'destroy a PCI RID'

We need to get away from this scheme of SRIOV where you bulk create a
bunch of empty VFs at one time and then have to somehow provision
them.

Jason