Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2

Jason Gunthorpe <jgg@xxxxxxxxxx> · Tue, 1 Feb 2022 20:24:59 -0400

On Tue, Feb 01, 2022 at 02:49:16PM -0700, Alex Williamson wrote:
> On Tue, 1 Feb 2022 14:36:20 -0400
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> 
> > On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote:
> > 
> > > Ok, let me parrot back to see if I understand.  -ENOTTY will be
> > > returned if the ioctl doesn't exist, in which case device_state is
> > > untouched and cannot be trusted.  At the same time, we expect the user
> > > to use the feature ioctl to make sure the ioctl exists, so it would
> > > seem that we've reclaimed that errno if we believe the user should
> > > follow the protocol.  
> > 
> > I don't follow - the documentation says what the code does, if you get
> > ENOTTY returned then you don't get the device_state too. Saying the
> > user shouldn't have called it in the first place is completely
> > correct, but doesn't change the device_state output.
> 
> The documentation says "...the device state output is not reliable", and
> I have to question whether this qualifies as a well specified,
> interoperable spec with such language.  We're essentially asking users
> to keep track that certain errnos result in certain fields of the
> structure _maybe_ being invalid.

So you are asking to remove "is not reliable" and just phrase is as:

"device_state is updated to the current value when -1 is returned,
except when these XXX errnos are returned?

(actually userspace can tell directly without checking the errno - as
if -1 is returned the device_state cannot be the requested target
state anyhow)

> Now you're making me wonder how much I care to invest in semantic
> arguments over extended errnos :-\

Well, I know I don't :) We don't have consistency in the kernel and
userspace is hard pressed to make any sense of it most of the time,
IMHO. It just doesn't practically matter..

> > We don't know the device_state in the core code because it can only be
> > read under locking that is controlled by the driver. I hope when we
> > get another driver merged that we can hoist the locking, but right now
> > I'm not really sure - it is a complicated lock.
> 
> The device cannot self transition to a new state, so if the core were
> to serialize this ioctl then the device_state provided by the driver is
> valid, regardless of its internal locking.

It is allowed to transition to RUNNING due to reset events it captures
and since we capture the reset through the PCI hook, not from VFIO,
the core code doesn't synchronize well. See patch 14

> Whether this ioctl should be serialized anyway is probably another good
> topic to breach.  Should a user be able to have concurrent ioctls
> setting conflicting states?

The driver is required to serialize, the core code doesn't touch any
global state and doesn't need serializing.

> I'd suggest that ioctl return structure is only valid at all on
> success and we add a GET interface to return the current device

We can do this too, but it is a bunch of code to achieve this and I
don't have any use case to read back the device_state beyond debugging
and debugging is fine with this. IMHO

> It's entirely possible that I'm overly averse to ioctl proliferation,
> but for every new ioctl we need to take a critical look at the proposed
> API, use case, applicability, and extensibility.  

This is all basicly the same no matter where it is put, the feature
multiplexer is just an ioctl in some semi-standard format, but the
vfio pattern of argsz/flags is also a standard format that is
basically the same thing.

We still need to think about extensibility, alignment, etc..

The problem I usually see with ioctls is not proliferation, but ending
up with too many choices and a big ?? when it comes to adding
something new.

Clear rules where things should go and why is the best, it matters
less what the rules actually are IMHO.

> > I don't want to touch capabilities, but we can try to use feature for
> > set state. Please confirm this is what you want.
> 
> It's a team sport, but to me it seems like it fits well both in my
> mental model of interacting with a device feature, without
> significantly altering the uAPI you're defining anyway.

Well, my advice is that ioctls are fine, and a bit easier all around.
eg strace and syzkaller are a bit easier if everything neatly maps
into one struct per ioctl - their generator tools are optimized for
this common case.

Simple multiplexors are next-best-fine, but there should be a clear
idea when to use the multiplexer, or not.

Things like the cap chains enter a whole world of adventure for
strace/syzkaller :)

> > You'll want the same for the PRE_COPY related information too?
> 
> I hadn't gotten there yet.  It seems like a discontinuity to me that
> we're handing out new FDs for data transfer sessions, but then we
> require the user to come back to the device to query about the data its
> reading through that other FD.  

An earlier draft of this put it on the data FD, but v6 made it fully
optional with no functional impact on the data FD. The values decrease
as the data FD progresses and increases as the VM dirties data - ie it
is 50/50 data_fd/device behavior.

It doesn't matter which way, but it feels quite weird to have the main
state function is a FEATURE and the precopy query is an ioctl.

> Should that be an ioctl on the data stream FD itself?  

I can be. Implementation wise it is about a wash.

> Is there a use case for also having it on the STOP_COPY FD?

I didn't think of one worthwhile enough to mandate implementing it in
every driver.

> > If we are into these very minor nitpicks does this mean you are OK
> > with all the big topics now?
> 
> I'm not hating it, but I'd like to see buy-in from others who have a
> vested interest in supporting migration.  I don't see Intel or Huawei
> on the Cc list and the original collaborators of the v1 interface
> from

That is an oversight, I'll ping them. I think people have been staying
away until the dust settles.

> NVIDIA have been silent through this redesign.

We've reviewed this internally with them. They reserve judgement on
the data transfer performance until they work on it, but functionally
it has all the necessary semantics.

They have the same P2P issue mlx5 does, and are happy with the
solution under the same general provisions as already discussed for
the Huawei device - RUNNING_P2P is sustainable only while the device
is not touched - ie the VCPU is halted.

The f_ops implemenation we used for mlx5 is basic, the full
performance version would want to use the read/write_iter() fop with
async completions to support the modern zero-copy iouring based data
motion in userspace. This is all part of the standard FD abstraction
and why it is appealing to use it.

Thanks,
Jason