Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

Jason Gunthorpe <jgg@xxxxxxxx> · Wed, 7 Dec 2022 09:34:14 -0400

On Wed, Dec 07, 2022 at 08:54:15AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
> > What the kernel is doing is providing the abstraction to link the
> > controlling function to the VFIO device in a general way.
> > 
> > We don't want to just punt this problem to user space and say 'good
> > luck finding the right cdev for migration control'. If the kernel
> > struggles to link them then userspace will not fare better on its own.
> 
> Yes.  But the right interface for that is to issue the userspace
> commands for anything that is not normal PCIe function level
> to the controlling funtion, and to discover the controlled functions
> based on the controlling functions.

I don't think we should mix up how the HW works, what PCI function the
commands are executed at, with how the uAPI is designed.

The VFIO design assumes that the "vfio migration driver" will talk to
both functions under the hood, and I don't see a fundamental problem
with this beyond it being awkward with the driver core.

It is done this way deliberately. When we did the work on it we found
there are problems with some device models, like when you suspend them
and then wrongly MMIO touch them they can trigger AER and maybe even a
machine check crash. One of the roles of the VFIO driver is to fix
these HW bugs and make the uAPI safe. Eg by revoking mmaps or
whatever.

Even more importantly, we do not want migration to ever be operated
unless VFIO is in control of the device. In general, migration resume
basically allows userspace to command the device to do effectively any
DMA. This is a kernel security problem if the device is not
constrained by a user iommu_domain - for security we must not allow
userspace to resume a VF that is under kernel control and potentially
linked to an passthrough iommu_domain.

VFIO provides the security model to make all of this safe - the two
concepts cannot become disconnected. Even if we create a new migration
uAPI it just means that the nvme driver has to be awkwardly aware of
VFIO VF drivers and interlock with their state, and the uAPI is
useless unless you already have a VFIO FD open.

Even the basic assumption that there would be a controlling/controlled
relationship is not universally true. The mdev type drivers, and 
SIOV-like devices are unlikely to have that. Once you can use PASID
the reasons to split things at the HW level go away, and a VF could
certainly self-migrate.

VFIO's goal is to abstract all of the above stuff. You get one char
device that inherently provides the security guarentees required to
operate migration. It provides all the necessary hooks to fix up HW
issues, which so far every merged device has:

 - mlx5 has a weird issue where FLR on the VF resets the migration
   context, we fix that in the VFIO driver (in mlx5 even though the
   commands are issued via the PF they are logically executed inside
   the VF context)

 - hi-silicon has security problems because it doesn't have
   controlling/controlled, so it needs to carve out BAR MMIO maps and
   other stuff

So while I agree that, in principle, migration and the VFIO VF are
seperate concerns our broken reality makes them linked.

Even the idea that started this thread - that PF/PF could be a problem
- seems to have been explored by the Pensando RFC which is using the
aux devices to connect arbitrary PF/PF for their model.

The logical model we have been using on complex multi-function devices
(like every DPU driver) has been to split the driver into subsystem
local code and thread all the pieces together with aux devices.

The model has a PCI driver that operates the lowest level of the
device, eg the 'admin queue' and then aux devices create
subsystem-local drivers (netdev, rdma, vdpa, iscsi, vfio, etc, etc)
that ride on the common API exported by the PCI driver.

So, when you see both Intel and Pensando proposing this kind of
layered model for NVMe where migration is subsystem-local to VFIO, I
think this is where the inspiration is coming from. Their native DPU
drivers already work this way.

Jason