Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2

Jason Gunthorpe <jgg@xxxxxxxxxx> · Tue, 1 Feb 2022 14:36:20 -0400

On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote:

> Ok, let me parrot back to see if I understand.  -ENOTTY will be
> returned if the ioctl doesn't exist, in which case device_state is
> untouched and cannot be trusted.  At the same time, we expect the user
> to use the feature ioctl to make sure the ioctl exists, so it would
> seem that we've reclaimed that errno if we believe the user should
> follow the protocol.

I don't follow - the documentation says what the code does, if you get
ENOTTY returned then you don't get the device_state too. Saying the
user shouldn't have called it in the first place is completely
correct, but doesn't change the device_state output.

> +       if (!device->ops->migration_set_state)
> +               return -EOPNOTSUPP;
> 
> Should return -ENOTTY, just as the feature does.  

As far as I know the kernel 'standard' is:
 - ENOTTY if the ioctl cmd # itself is not understood
 - E2BIG if the ioctl arg is longer than the kernel understands
 - EOPNOTSUPP if the ioctl arg contains data the kernel doesn't
   understand (eg flags the kernel doesn't know about), or the
   kernel understands the request but cannot support it for some
   reason.
 - EINVAL if the ioctl arg contains data the kernel knows about but
   rejects (ie invalid combinations of flags)

VFIO kind of has its own thing, but I'm not entirely sure what the
rules are, eg you asked for EOPNOTSUPP in the other patch, and here we
are asking for ENOTTY?

But sure, lets make it ENOTTY.

> But it's also for future unsupported ops, but couldn't we also
> specify that the driver must fill final_state with the current
> device state for any such case.  We also have this:
> 
> +       if (set_state.argsz < minsz || set_state.flags)
> +               return -EOPNOTSUPP;
> 
> Which I think should be -EINVAL.

That would match the majority of other VFIO tests.

> That leaves -EFAULT, for example:
> 
> +       if (copy_from_user(&set_state, arg, minsz))
> +               return -EFAULT;
> 
> Should we be able to know the current device state in core code such
> that we can fill in device state here?

There is no point in doing a copy_to_user() to the same memory if a
copy_from_user() failed, so device_state will still not be returned.

We don't know the device_state in the core code because it can only be
read under locking that is controlled by the driver. I hope when we
get another driver merged that we can hoist the locking, but right now
I'm not really sure - it is a complicated lock.

> I think those changes would go a ways towards fully specified behavior
> instead of these wishy washy unreliable return values.  Then we could

Huh? It is fully specified already. These changes just removed
EOPNOTSUPP from the list where device_state isn't filled in. It is OK,
but it is not really different...

>  "If this function fails and returns -1 then..."
> 
> Could we clarify that to s/function/ioctl/?  It caused me a moment of
> confusion for the returned -errnos.

Sure.

> > > Should we be bumping a reference on the device FD such that we can't
> > > have outstanding migration FDs with the device closed (and
> > > re-assigned to a new user)?  
> > 
> > The driver must ensure any activity triggered by the migration FD
> > against the vfio_device is halted before close_device() returns, just
> > like basically everything else connected to open/close_device(). mlx5
> > does this by using the same EOF sanitizing the FSM logic uses.
> > 
> > Once sanitized the f_ops should not be touching the vfio_device, or
> > even have a pointer to it, so there is no reason to connect the two
> > FDs together. I'd say it is a red flag if a driver proposes to do
> > this, likely it means it has a problem with the open/close_device()
> > lifetime model.
> 
> Maybe we just need a paragraph somewhere to describe the driver
> responsibilities and expectations in managing the migration FD,
> including disconnecting it after end of stream and access relative to
> the open state of the vfio_device.  Seems an expanded descriptions
> somewhere near the declaration in vfio_device_ops would be appropriate.

Yes that is probably better than in the uapi header.

> > I'm not sure what the overall VFIO vision is here.. Are we abandoning
> > traditional ioctls in favour of a multiplexer? Calling the multiplexer
> > ioctl "feature" is a bit odd..
> 
> Is it really?  VF Token support is a feature that a device might have
> and we can use the same interface to probe that it exists as well as
> set the UUID token.  We're using it to manipulate the state of a device
> feature.
> 
> If we're only looking for a means to expose that a device has support
> for something, our options are a flag bit on the vfio_device_info or a
> capability on that ioctl.  It's arguable that the latter might be a
> better option for VFIO_DEVICE_FEATURE_MIGRATION since its purpose is
> only to return a flags field, ie. we're not interacting with a feature,
> we're exposing a capability with fixed properties.

I looked at this, and decided against it on practical reasons.

I've organized this so the core code can do more work for the driver,
which means the core code supplies the support info back to
userspace. VFIO_DEVICE_INFO is currently open coded in every single
driver and lifting that to get the same support looks like a huge
pain. Even if we try to work it backwards somehow, we'd need to
re-organize vfio-pci so other drivers can contribute to the cap chain -
which is another ugly looking thing.

On top of that, qemu becomes much less straightforward as we have to
piggy back on the existing vfio code instead of just doing a simple
ioctl to get back the small support info back. There is even an
unpleasing mandatory user/kernel memory allocation and double ioctl in
the caps path.

The feature approach is much better, it has a much cleaner
implementation in user/kernel. I think we should focus on it going
forward and freeze caps.

> > It complicates the user code a bit, it is more complicated to invoke the
> > VFIO_DEVICE_FEATURE (check the qemu patch to see the difference).
> 
> Is it really any more than some wrapper code?  Are there objections to
> this sort of multiplexer?

There isn't too much reason to do this kind of stuff. Each subsystem
gets something like 4 million ioctl numbers within its type, we will
never run out of unique ioctls.

Normal ioctls have a nice simplicity to them, adding layers creates
complexity, feature is defiantly more complex to implement, and cap
is a whole other level of more complex. None of this is necessary.

I don't know what "cluttering" means here, I'd prefer we focus on
things that give clean code and simple implementations than arbitary
aesthetics.

> > Either way I don't have a strong opinion, please have a think and let
> > us know which you'd like to follow.
> 
> I'm leaning towards a capability for migration support flags and a
> feature for setting the state, but let me know if this looks like a bad
> idea for some reason.  Thanks,

I don't want to touch capabilities, but we can try to use feature for
set state. Please confirm this is what you want.

You'll want the same for the PRE_COPY related information too?

If we are into these very minor nitpicks does this mean you are OK
with all the big topics now?

Jason