Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2

Jason Gunthorpe <jgg@xxxxxxxxxx> · Mon, 31 Jan 2022 20:31:24 -0400

On Mon, Jan 31, 2022 at 04:43:18PM -0700, Alex Williamson wrote:
> On Sun, 30 Jan 2022 18:08:19 +0200
> Yishai Hadas <yishaih@xxxxxxxxxx> wrote:
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index ef33ea002b0b..d9162702973a 100644
> > +++ b/include/uapi/linux/vfio.h
> > @@ -605,10 +605,10 @@ struct vfio_region_gfx_edid {
> >  
> >  struct vfio_device_migration_info {
> >  	__u32 device_state;         /* VFIO device state */
> > -#define VFIO_DEVICE_STATE_STOP      (0)
> > -#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > -#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > -#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > +#define VFIO_DEVICE_STATE_V1_STOP      (0)
> > +#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
> > +#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
> > +#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)
> 
> I assume the below is kept until we rip out all the references, but I'm
> not sure why we're bothering to define V1 that's not used anywhere
> versus just deleting the above to avoid collision with the new enum.

I felt adding the deletion made this patch too big so I shoved it into
its own patch after the v2 stuff is described. The rename here is only
because we end up with a naming conflict with the enum below.

> > + * If this function fails and returns -1 then the device_state is updated with
> > + * the current state the device is in. This may be the original operating state
> > + * or some other state along the combination transition path. The user can then
> > + * decide if it should execute a VFIO_DEVICE_RESET, attempt to return to the
> > + * original state, or attempt to return to some other state such as RUNNING or
> > + * STOP. If errno is set to EOPNOTSUPP, EFAULT or ENOTTY then the device_state
> > + * output is not reliable.
> 
> I haven't made it through the full series yet, but it's not clear to me
> why these specific errnos are being masked above.

Basically, we can't return the device_state unless we properly process
the ioctl. Eg old kernels that do not support this will return ENOTTY
and will not update it. If userspace messed up the pointer EFAULT will
be return and it will not be updated, finally EOPNOTSUPP is a generic
escape for any future reason the kernel might not want to update it.

In practice, I found no use for using the device_state in the error
path in qemu, but it seemed useful for debugging.

> > + * If the new_state starts a new data transfer session then the FD associated
> > + * with that session is returned in data_fd. The user is responsible to close
> > + * this FD when it is finished. The user must consider the migration data
> > + * segments carried over the FD to be opaque and non-fungible. During RESUMING,
> > + * the data segments must be written in the same order they came out of the
> > + * saving side FD.
> 
> The lifecycle of this FD is a little sketchy.  The user is responsible
> to close the FD, are they required to?

No. Detecting this in the kernel would be notable added complexity to
the drivers.

Let's clarify it:

 "close this FD when it no longer has data to
 read/write. data_fds are not re-used, every data transfer session gets
 a new FD."

?

> ie. should the migration driver fail transitions if there's an
> outstanding FD?

No, the driver should orphan that FD and use a fresh new one the next
cycle. mlx5 will sanitize the FD, free all the memory, and render it
inoperable which I'd view as best practice.

> Should the core code mangle the f_ops or force and EOF or in some
> other way disconnect the FD to avoid driver bugs/exploits with users
> poking stale FDs?  

We looked at swapping f_ops of a running fd for the iommufd project
and decided it was not allowed/desired. It needs locking.

Here the driver should piggy back the force EOF using its own existing
locking protecting concurrent read/write, like mlx5 did. It is
straightforward.

> Should we be bumping a reference on the device FD such that we can't
> have outstanding migration FDs with the device closed (and
> re-assigned to a new user)?

The driver must ensure any activity triggered by the migration FD
against the vfio_device is halted before close_device() returns, just
like basically everything else connected to open/close_device(). mlx5
does this by using the same EOF sanitizing the FSM logic uses.

Once sanitized the f_ops should not be touching the vfio_device, or
even have a pointer to it, so there is no reason to connect the two
FDs together. I'd say it is a red flag if a driver proposes to do
this, likely it means it has a problem with the open/close_device()
lifetime model.

> > + * Setting device_state to VFIO_DEVICE_STATE_ERROR will always fail with EINVAL,
> > + * and take no action. However the device_state will be updated with the current
> > + * value.
> > + *
> > + * Return: 0 on success, -1 and errno set on failure.
> > + */
> > +struct vfio_device_mig_set_state {
> > +	__u32 argsz;
> > +	__u32 device_state;
> > +	__s32 data_fd;
> > +	__u32 flags;
> > +};
> 
> argsz and flags layout is inconsistent with all other vfio ioctls.

OK

> 
> > +
> > +#define VFIO_DEVICE_MIG_SET_STATE _IO(VFIO_TYPE, VFIO_BASE + 21)
> 
> Did you consider whether this could also be implemented as a
> VFIO_DEVICE_FEATURE?  Seems the feature struct would just be
> device_state and data_fd.  Perhaps there's a use case for GET as well.
> Thanks,

Only briefly..

I'm not sure what the overall VFIO vision is here.. Are we abandoning
traditional ioctls in favour of a multiplexer? Calling the multiplexer
ioctl "feature" is a bit odd..

It complicates the user code a bit, it is more complicated to invoke the
VFIO_DEVICE_FEATURE (check the qemu patch to see the difference).

Either way I don't have a strong opinion, please have a think and let
us know which you'd like to follow.

Thanks,
Jason