Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 20 Oct 2021 15:07:09 -0600

On Wed, 20 Oct 2021 15:59:19 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Wed, Oct 20, 2021 at 10:52:30AM -0600, Alex Williamson wrote:
> 
> > I'm wondering if we're imposing extra requirements on the !_RUNNING
> > state that don't need to be there.  For example, if we can assume that
> > all devices within a userspace context are !_RUNNING before any of the
> > devices begin to retrieve final state, then clearing of the _RUNNING
> > bit becomes the device quiesce point and the beginning of reading
> > device data is the point at which the device state is frozen and
> > serialized.  No new states required and essentially works with a slight
> > rearrangement of the callbacks in this series.  Why can't we do that?  
> 
> It sounds worth checking carefully. I didn't come up with a major
> counter scenario.
> 
> We would need to specifically define which user action triggers the
> device to freeze and serialize. Reading pending_bytes I suppose?

The first read of pending_bytes after clearing the _RUNNING bit would
be the logical place to do this since that's what we define as the start
of the cycle for reading the device state.

"Freezing" the device is a valid implementation, but I don't think it's
strictly required per the uAPI.  For instance there's no requirement
that pending_bytes is reduced by data_size on each iteratio; we
specifically only define that the state is complete when the user reads
a pending_bytes value of zero.  So a driver could restart the device
state if the device continues to change (though it's debatable whether
triggering an -errno on the next migration region access might be a
more supportable approach to enforce that userspace has quiesced
external access).

> Since freeze is a device command we need to be able to report failure
> and to move the state to error, that feels bad hidden inside reading
> pending_bytes.

That seems like the wrong model.  Reading pending_bytes can return
-errno should an error occur during freeze/serialize, but device_state
is never implicitly modified.  Upon encountering an error reading
pending_bytes, userspace would abort the migration and move the
device_state to a new value.  This is the point at which the driver
would return another -errno to indicate an unrecoverable internal
error, or honor the requested new state.

> > Maybe a clarification of the uAPI spec is sufficient to achieve this,
> > ex. !_RUNNING devices may still update their internal state machine
> > based on external access.  Userspace is expected to quiesce all external
> > access prior to initiating the retrieval of the final device state from
> > the data section of the migration region.  Failure to do so may result
> > in inconsistent device state or optionally the device driver may induce
> > a fault if a quiescent state is not maintained.  
> 
> But on the other hand this seem so subtle and tricky way to design a
> uAPI that devices and users are unlikely to implement it completely
> correctly.

I think it's only tricky if device_state is implicitly modified by
other actions, stopping all devices before collecting the state of any
of them seems like a reasonable requirement.  It's the same requirement
we'd add with an NDMA bit.

> IMHO the point of the state is to control the current behavior of the
> device - yes we can control behavior on other operations, but it feels
> obfuscated.

It is, don't do that.  Fail the migration region operation, fail the
next device state transition if the internal error is irrecoverable.

> Especially when the userspace that needs to know about this isn't even
> deployed yet, let's just do it cleanly?

That's an option, but again I'm wondering if we're imposing a
requirement on the !_RUNNING state that doesn't really exist and a
clean solution exists already.  Thanks,

Alex