Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 1 Feb 2022 10:04:08 -0700

On Mon, 31 Jan 2022 20:31:24 -0400
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Mon, Jan 31, 2022 at 04:43:18PM -0700, Alex Williamson wrote:
> > On Sun, 30 Jan 2022 18:08:19 +0200
> > Yishai Hadas <yishaih@xxxxxxxxxx> wrote:  
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index ef33ea002b0b..d9162702973a 100644
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -605,10 +605,10 @@ struct vfio_region_gfx_edid {
> > >  
> > >  struct vfio_device_migration_info {
> > >  	__u32 device_state;         /* VFIO device state */
> > > -#define VFIO_DEVICE_STATE_STOP      (0)
> > > -#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > > -#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > > -#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > > +#define VFIO_DEVICE_STATE_V1_STOP      (0)
> > > +#define VFIO_DEVICE_STATE_V1_RUNNING   (1 << 0)
> > > +#define VFIO_DEVICE_STATE_V1_SAVING    (1 << 1)
> > > +#define VFIO_DEVICE_STATE_V1_RESUMING  (1 << 2)  
> > 
> > I assume the below is kept until we rip out all the references, but I'm
> > not sure why we're bothering to define V1 that's not used anywhere
> > versus just deleting the above to avoid collision with the new enum.  
> 
> I felt adding the deletion made this patch too big so I shoved it into
> its own patch after the v2 stuff is described. The rename here is only
> because we end up with a naming conflict with the enum below.

Right, but we could just as easily delete the above 4 lines here to
avoid the conflict rather than renaming them to V1.

> > > + * If this function fails and returns -1 then the device_state is updated with
> > > + * the current state the device is in. This may be the original operating state
> > > + * or some other state along the combination transition path. The user can then
> > > + * decide if it should execute a VFIO_DEVICE_RESET, attempt to return to the
> > > + * original state, or attempt to return to some other state such as RUNNING or
> > > + * STOP. If errno is set to EOPNOTSUPP, EFAULT or ENOTTY then the device_state
> > > + * output is not reliable.  
> > 
> > I haven't made it through the full series yet, but it's not clear to me
> > why these specific errnos are being masked above.  
> 
> Basically, we can't return the device_state unless we properly process
> the ioctl. Eg old kernels that do not support this will return ENOTTY
> and will not update it. If userspace messed up the pointer EFAULT will
> be return and it will not be updated, finally EOPNOTSUPP is a generic
> escape for any future reason the kernel might not want to update it.
> 
> In practice, I found no use for using the device_state in the error
> path in qemu, but it seemed useful for debugging.

Ok, let me parrot back to see if I understand.  -ENOTTY will be
returned if the ioctl doesn't exist, in which case device_state is
untouched and cannot be trusted.  At the same time, we expect the user
to use the feature ioctl to make sure the ioctl exists, so it would
seem that we've reclaimed that errno if we believe the user should
follow the protocol.

-EOPNOTSUPP is returned both if the driver doesn't support migration
(which should be invalid based on the protocol).  ie. this:

+       if (!device->ops->migration_set_state)
+               return -EOPNOTSUPP;

Should return -ENOTTY, just as the feature does.  But it's also for
future unsupported ops, but couldn't we also specify that the driver
must fill final_state with the current device state for any such case.
We also have this:

+       if (set_state.argsz < minsz || set_state.flags)
+               return -EOPNOTSUPP;

Which I think should be -EINVAL.

That leaves -EFAULT, for example:

+       if (copy_from_user(&set_state, arg, minsz))
+               return -EFAULT;

Should we be able to know the current device state in core code such
that we can fill in device state here?

I think those changes would go a ways towards fully specified behavior
instead of these wishy washy unreliable return values.  Then we could
also get rid of this paranoia protection of those errnos:

+       if (IS_ERR(filp)) {
+               if (WARN_ON(PTR_ERR(filp) == -EOPNOTSUPP ||
+                           PTR_ERR(filp) == -ENOTTY ||
+                           PTR_ERR(filp) == -EFAULT))
+                       filp = ERR_PTR(-EINVAL);
+               goto out_copy;
+       }

Also, the original text of this uapi paragraph reads:

 "If this function fails and returns -1 then..."

Could we clarify that to s/function/ioctl/?  It caused me a moment of
confusion for the returned -errnos.

> > > + * If the new_state starts a new data transfer session then the FD associated
> > > + * with that session is returned in data_fd. The user is responsible to close
> > > + * this FD when it is finished. The user must consider the migration data
> > > + * segments carried over the FD to be opaque and non-fungible. During RESUMING,
> > > + * the data segments must be written in the same order they came out of the
> > > + * saving side FD.  
> > 
> > The lifecycle of this FD is a little sketchy.  The user is responsible
> > to close the FD, are they required to?  
> 
> No. Detecting this in the kernel would be notable added complexity to
> the drivers.
> 
> Let's clarify it:
> 
>  "close this FD when it no longer has data to
>  read/write. data_fds are not re-used, every data transfer session gets
>  a new FD."
> 
> ?

Better

> > ie. should the migration driver fail transitions if there's an
> > outstanding FD?  
> 
> No, the driver should orphan that FD and use a fresh new one the next
> cycle. mlx5 will sanitize the FD, free all the memory, and render it
> inoperable which I'd view as best practice.

Agreed, can we add a second sentence to the above clarification to
outline those driver responsibilities?

> > Should the core code mangle the f_ops or force and EOF or in some
> > other way disconnect the FD to avoid driver bugs/exploits with users
> > poking stale FDs?    
> 
> We looked at swapping f_ops of a running fd for the iommufd project
> and decided it was not allowed/desired. It needs locking.
> 
> Here the driver should piggy back the force EOF using its own existing
> locking protecting concurrent read/write, like mlx5 did. It is
> straightforward.

Right, sounded ugly but I thought I'd toss it out.  If we define it as
the driver's responsibility, I think I'm ok.

> > Should we be bumping a reference on the device FD such that we can't
> > have outstanding migration FDs with the device closed (and
> > re-assigned to a new user)?  
> 
> The driver must ensure any activity triggered by the migration FD
> against the vfio_device is halted before close_device() returns, just
> like basically everything else connected to open/close_device(). mlx5
> does this by using the same EOF sanitizing the FSM logic uses.
> 
> Once sanitized the f_ops should not be touching the vfio_device, or
> even have a pointer to it, so there is no reason to connect the two
> FDs together. I'd say it is a red flag if a driver proposes to do
> this, likely it means it has a problem with the open/close_device()
> lifetime model.

Maybe we just need a paragraph somewhere to describe the driver
responsibilities and expectations in managing the migration FD,
including disconnecting it after end of stream and access relative to
the open state of the vfio_device.  Seems an expanded descriptions
somewhere near the declaration in vfio_device_ops would be appropriate.

> > > + * Setting device_state to VFIO_DEVICE_STATE_ERROR will always fail with EINVAL,
> > > + * and take no action. However the device_state will be updated with the current
> > > + * value.
> > > + *
> > > + * Return: 0 on success, -1 and errno set on failure.
> > > + */
> > > +struct vfio_device_mig_set_state {
> > > +	__u32 argsz;
> > > +	__u32 device_state;
> > > +	__s32 data_fd;
> > > +	__u32 flags;
> > > +};  
> > 
> > argsz and flags layout is inconsistent with all other vfio ioctls.  
> 
> OK
> 
> >   
> > > +
> > > +#define VFIO_DEVICE_MIG_SET_STATE _IO(VFIO_TYPE, VFIO_BASE + 21)  
> > 
> > Did you consider whether this could also be implemented as a
> > VFIO_DEVICE_FEATURE?  Seems the feature struct would just be
> > device_state and data_fd.  Perhaps there's a use case for GET as well.
> > Thanks,  
> 
> Only briefly..
> 
> I'm not sure what the overall VFIO vision is here.. Are we abandoning
> traditional ioctls in favour of a multiplexer? Calling the multiplexer
> ioctl "feature" is a bit odd..

Is it really?  VF Token support is a feature that a device might have
and we can use the same interface to probe that it exists as well as
set the UUID token.  We're using it to manipulate the state of a device
feature.

If we're only looking for a means to expose that a device has support
for something, our options are a flag bit on the vfio_device_info or a
capability on that ioctl.  It's arguable that the latter might be a
better option for VFIO_DEVICE_FEATURE_MIGRATION since its purpose is
only to return a flags field, ie. we're not interacting with a feature,
we're exposing a capability with fixed properties.

However as we move to MIG_SET_SET, well now we are interacting with a
feature of the device and there's really nothing unique about the
calling convention that would demand that we define a stand alone ioctl.

> It complicates the user code a bit, it is more complicated to invoke the
> VFIO_DEVICE_FEATURE (check the qemu patch to see the difference).

Is it really any more than some wrapper code?  Are there objections to
this sort of multiplexer?  As I was working on the VF Token support, it
felt like a fairly small device feature and I didn't want to set a
precedent of cluttering our ioctl space with every niche little
feature.  The s390 folks have some proposals on list for using features
and I'm tempted to suggest it to Abhishek as well for their
implementation of D3cold support.

> Either way I don't have a strong opinion, please have a think and let
> us know which you'd like to follow.

I'm leaning towards a capability for migration support flags and a
feature for setting the state, but let me know if this looks like a bad
idea for some reason.  Thanks,

Alex