Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state

Jason Gunthorpe <jgg@xxxxxxxxxx> · Tue, 4 Jan 2022 16:28:34 -0400

On Mon, Dec 20, 2021 at 03:26:23PM -0700, Alex Williamson wrote:

> > It also raises the question that it seems not well defined what the
> > sequence:
> > 
> > SAVING -> SAVING | RUNNING
> > 
> > Even does to the migration window?
> > 
> > Nor can I imagine what mlx5 could do beyond fail this or corrupt the
> > migration..
> 
> I think this comes down to the robustness of the migration driver.  The
> migration data window and control of how userspace is to interact with
> it is essentially meant to allow the migration driver to implement its
> own transport protocol.  In the case of mlx5 where it expects only to
> apply the received migration data on the RESUMING -> RUNNING
> transition, a "short" data segment might be reserved for providing
> sequencing information.  Each time the device enters SAVING | !RUNNING
> the driver might begin by presenting a new sequence header.  On the
> target, a new sequence header would cause any previously received
> migration data to be discarded.  A similar header would also be
> suggested to validate the migration data stream is appropriate for the
> target device.

Honestly, I have no interest in implementing something so complicated
for what should be a simple operation. We have no use case for this,
no desire to test it, it is just pure kernel cruft and complexity to
do this kind of extra work.

I think it is very telling that we are, what, 10 weeks into this now,
we have seen two HW drivers posted, and NOTHING implements like you
are imagining here. I doubt the closed drivers do any better.

Let's not make up busy work that can't be strongly justified! 

That is substantially what I see as wrong with this whole line of
thinking that the device_state must be independent bits, not a
constrained FSM.

We are actively failing, again and again, to tell if drivers are
implementing this mulit-bit vision correctly, or even specify properly
how it should work in an implementable way.

> > IMHO, we should be simplifing this before it becomes permanent API,
> > not trying to force it to work.
> 
> I agree, this is our opportunity to simplify before we're committed,
> but I don't see how we can throw out perfectly valid transitions like
> SAVING -> SAVING | RUNNING just because the driver hasn't accounted for
> managing data in the data stream.

Don't see? What do you mean? We showed how to do this exactly in the
v1.

We define the meaning of device_state to be actual FSM and write out
the allowed FSM arcs and then properly describe them.

This is what everyone else though this was already.

AFAICT you are the only person to view this as a bunch of bits. It is
why the original header file comment gave names to each of the states
and roughtly sketches out legal state transition arcs with the odd
ascii art.

So I want to stop trying to make the bunch of bits idea work. Let's
make it a FSM, let's define exactly the legal arcs, and then define
the behaviors in each FSM state. It is easy to understand and we have
a hope to get inter-operable implementations.

All the driver postings so far are demonstrating they don't get
oddball transition arcs correct, and we are clearly not able to find
these things during review.

And now you are asking for alot of extra work and complications in the
driver to support arcs that will never be used - that is really too
far, sorry.

> > Most of things I brought up in this post are resolved by the forced
> > FSM.
> 
> Until userspace tries to do something different than exactly what it
> does today, and then what?

It can't. We define the API to be exactly the permited arcs and no
others. That is what simplify means.

If we need to add new FSM arcs and new states later then their support
can be exposed via a cap flag.

This is much better than trying to define all arcs as legal, only
testing/using a small subset and hoping the somehow in the future this
results in extensible interoperability.

> > Another view is that staying in a useless state is also pointless and
> > we may as well return ERROR anyhow. Eg if exiting RESUMING failed
> > there is no other action to take besides RESET, so why didn't we
> > return ERROR to tell this directly to userspace?
> 
> And then the last packet arrives, gets written to the device that's
> still in RESUMING, and now can transition to RUNNING.

Huh? If the device indicated error during RESUMING userspace should
probably stop shoving packets into it or it will possibly corrupt the
migration stream.

> But I think we're really only after that p2p behavior and we've
> discussed that disabling p2p mappings in the VM would be a sufficient
> condition to support multiple devices without NDMA support.  I think
> that means DMA to memory is fine and DMA related to MSI is fine... but
> how does a device know which DMA is memory and which DMA is another
> device?

The device doesn't know if a particular DMA is P2P or not. This is why
the device action is called 'NO DMA'.

MSI is fine to be left unspecified because we currently virtualize all
the MSI register writes and it is impossible for a hostile guest to
corrupt them to point the address to anything but the interrupt
controller. If a MSI triggers or not in NDMA doesn't practically
matter.

It only starts to matter someday if we get the world Thomas is
thinking about where the guest can directly program the MSI registers.

> > Even if it did work reliably, the requirement is userspace must issue
> > RESET to exit ERROR and if we say the driver has to issue reset to
> > enter ERROR we are just doing a pointless double RESET.
> 
> Please read what I wrote:
> 
>     This variant is specifically chosen due to the !RUNNING state of
>     the device as the migration driver should do everything possible,
>     including an internal reset of the device, to ensure that the
>     device is fully stopped in this state.
> 
> That does not say that a driver must issue a reset to enter the ERROR
> state.  

Huh? "everything possible including an internal reset" sure sounds
like a device must issue a reset in some cases. If we keep with your
idea to rarely use ERROR then I think all the mlx5 cases that would
hit it are already so messed up that RESET will be mandatory.

> I don't have a problem if the reset is redundant to one the user needs
> to do anyway, I'd rather see any externally visible operation of the
> device stopped ASAP.  

Why? It was doing all those things before it had an error, why
should it suddenly stop now? What is this extra work helping?

Remember if we choose to return an error code instead of ERROR the
device is still running as it was, I don't see an benifit to making
ERROR different here.

ERROR just means the device has to be reset, we don't need the device
to stop doing what it was doing.

> The new and old state NDMA-like properties is also irrelevant, if a
> device enters an ERROR state moving from RUNNING -> SAVING | RUNNING
> it shouldn't continue manipulating memory and generating interrupts
> in the background.

I prefer a model where the device is allowed to keep doing whatever it
was doing before it hit the error. You are pushing for a model where
upon error we must force the device to stop.

For this view it is why the old state matters, if it was previously
allowed to DMA then it continues to be allowed to do DMA, etc.

> > The mlx5 v1 with the FSM didn't have alot of these problems we are
> > discussing. It didn't have precedence issues, it didn't have problems
> > executing odd combinations it can't support, it worked and was simple
> > to understand.
> 
> And an audit of that driver during review found that it grossly failed
> to meet the spirit of a "should" requirement.

That isn't how I see things.. The v1 driver implemented the uAPI we
all thought existed, was the FSM based uAPI the original patch authors
intended, and implemented only the FSM arcs discussed in the header
file comment.

Your idea that this is not a FSM seems to be unique here. I think
we've explored it to a reasonable conclusion to find it isn't working
out. Let's stop please.

Yishai can prepare a version with the FSM design back in including
NDMA and we can look at it.

> > In which case, why not formally specify the FSM now and avoid a driver
> > pushing a defacto spec?
> 
> It really only takes one driver implementing something like SAVING ->
> SAVING | RUNNING and QEMU taking advantage of it as a supported
> transition per the uAPI for mlx5 to be left out of the feature that
> might provide.

The uAPI spec is irrelavent, qemu can't just suddenly start doing
things that don't work on supported drivers.

What we've seen many times in the kernel is that uapis that don't have
driver interoperability don't succeed.

Jason