Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 09, 2021 at 04:34:29PM -0700, Alex Williamson wrote:
> A new NDMA state is being proposed to support a quiescent state for
> contexts containing multiple devices with peer-to-peer DMA support.
> Formally define it.
> 
> Clarify various aspects of the migration region data fields and
> protocol.  Remove QEMU related terminology and flows from the uAPI;
> these will be provided in Documentation/ so as not to confuse the
> device_state bitfield with a finite state machine with restricted
> state transitions.
> 
> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
>  include/uapi/linux/vfio.h |  405 ++++++++++++++++++++++++---------------------
>  1 file changed, 214 insertions(+), 191 deletions(-)

I need other peope to look this over, so these are just some quick
remarks. Thanks for doing it, it is very good.

Given I'm almost on vacation till Jan I think we will shortly have to
table this discussion to January.

But, if you are happy with this as all that is needed to do mlx5 we
can possibly have the v6 updated in early January, after the next
merge window.

Though lets try to quickly decide on what to do about the "change
multiple bits" below, please.

> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ef33ea002b0b..1fdbc928f886 100644
> +++ b/include/uapi/linux/vfio.h
> @@ -408,199 +408,211 @@ struct vfio_region_gfx_edid {
>  #define VFIO_REGION_SUBTYPE_MIGRATION           (1)
>  
>  /*
> + * The structure vfio_device_migration_info is placed at the immediate start of
> + * the per-device VFIO_REGION_SUBTYPE_MIGRATION region to manage the device
> + * state and migration information for the device.  Field accesses for this
> + * structure are only supported using their native width and alignment,
> + * accesses otherwise are undefined and the kernel migration driver should
> + * return an error.
>   *
>   * device_state: (read/write)
> + *   The device_state field is a bitfield written by the user to transition
> + *   the associated device between valid migration states using these rules:
> + *     - The user may read or write the device state register at any time.
> + *     - The kernel migration driver must fully transition the device to the
> + *       new state value before the write(2) operation returns to the user.
> + *     - The user may change multiple bits of the bitfield in the same write
> + *       operation, so long as the resulting state is valid.

I would like to forbid this. It is really too complicated, and if
there is not a strongly defined device behavior when this is done it
is not inter-operable for userspace to do it.

> + *     - The kernel migration driver must not generate asynchronous device
> + *       state transitions outside of manipulation by the user or the
> + *       VFIO_DEVICE_RESET ioctl as described below.
> + *     - In the event of a device state transition failure, the kernel
> + *       migration driver must return a write(2) error with appropriate errno
> + *       to the user.
> + *     - Upon such an error, re-reading the device_state field must indicate
> + *       the device migration state as either the same state as prior to the
> + *       failing write or, at the migration driver discretion, indicate the
> + *       device state as VFIO_DEVICE_STATE_ERROR.

It is because this is complete nightmare. Let's say the user goes from
0 -> SAVING | RUNNING and SAVING fails after we succeed to do
RUNNING. We have to also backtrack and undo RUNNING, but what if that
fails too? Oh and we have to keep track of all this backtracking while
executing the new state and write a bunch of complicated never tested
error handling code.

Assuming we can even figure out what the precedence of multiple bits
even means for interoperability.

Backed into this is an assumption that any device transition is fully
atomic - that just isn't what I see any of the HW has done.

I thought we could tackled this when you first suggested it (eg copy
the mlx5 logic and be OK), but now I'm very skeptical. The idea that
every driver can do this right in all the corner cases doesn't seem
reasonable given we've made so many errors here already just in mlx5.

> + *     - Bit 1 (SAVING) [REQUIRED]:
> + *        - Setting this bit enables and initializes the migration region data

I would use the word clear instead of initialize - the point of this
is to throw away any data that may be left over in the window from any
prior actions.

> + *          window and associated fields within vfio_device_migration_info for
> + *          capturing the migration data stream for the device.  The migration
> + *          driver may perform actions such as enabling dirty logging of device
> + *          state with this bit.  The SAVING bit is mutually exclusive with the
> + *          RESUMING bit defined below.
> + *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
> + *          data window and indicates the completion or termination of the
> + *          migration data stream for the device.

I don't know what "de-initialized" means as something a device should
do? IMHO there is no need to talk about the migration window here,
SAVING says initialize/clear - and data_offset/etc say their values
are undefined outside SAVING/RUNNING. That is enough.

> + *     - Bit 2 (RESUMING) [REQUIRED]:
> + *        - Setting this bit enables and initializes the migration region data
> + *          window and associated fields within vfio_device_migration_info for
> + *          restoring the device from a migration data stream captured from a
> + *          SAVING session with a compatible device.  The migration driver may
> + *          perform internal device resets as necessary to reinitialize the
> + *          internal device state for the incoming migration data.
> + *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
> + *          region data window and indicates the end of a resuming session for
> + *          the device.  The kernel migration driver should complete the
> + *          incorporation of data written to the migration data window into the
> + *          device internal state and perform final validity and consistency
> + *          checking of the new device state.  If the user provided data is
> + *          found to be incomplete, inconsistent, or otherwise invalid, the
> + *          migration driver must indicate a write(2) error and follow the
> + *          previously described protocol to return either the previous state
> + *          or an error state.

Prefer this is just 'go to an error state' to avoid unnecessary
implementation differences.

> + *     - Bit 3 (NDMA) [OPTIONAL]:
> + *        The NDMA or "No DMA" state is intended to be a quiescent state for
> + *        the device for the purposes of managing multiple devices within a
> + *        user context where peer-to-peer DMA between devices may be active.
> + *        Support for the NDMA bit is indicated through the presence of the
> + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
> + *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
> + *        region.
> + *        - Setting this bit must prevent the device from initiating any
> + *          new DMA or interrupt transactions.  The migration driver must

I'm not sure about interrupts.

> + *          complete any such outstanding operations prior to completing
> + *          the transition to the NDMA state.  The NDMA device_state

Reading this as you wrote it and I suddenly have a doubt about the PRI
use case. Is it reasonable that the kernel driver will block on NDMA
waiting for another userspace thread to resolve any outstanding PRIs?

Can that allow userspace to deadlock the kernel or device? Is there an
alterative?

> + *   All combinations for the above defined device_state bits are considered
> + *   valid with the following exceptions:
> + *     - RESUMING and SAVING are mutually exclusive, all combinations of
> + *       (RESUMING | SAVING) are invalid.  Furthermore the specific combination
> + *       (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the
> + *       device error state VFIO_DEVICE_STATE_ERROR.  This variant is
> + *       specifically chosen due to the !RUNNING state of the device as the
> + *       migration driver should do everything possible, including an internal
> + *       reset of the device, to ensure that the device is fully stopped in
> + *       this state.  

Prefer we don't specify this. ERROR is undefined behavior and
userspace should reset. Any path that leads along to ERROR already
includes possiblities for wild DMAs and what not, so there is nothing
to be gained by this other than causing a lot of driver complexity,
IMHO.

> + *   Migration drivers should attempt to support any transition between valid

should? must, I think.

The whole migration window definition seems quite straightforward now!

> + * a) The user reads pending_bytes.  If the read value is zero, no data is
> + *    currently available for the device.  If the device is !RUNNING and a
> + *    zero value is read, this indicates the end of the device migration
> + *    stream and the device must not generate any new migration data.  If
> + *    the read value is non-zero, the user may proceed to collect device
> + *    migration data in step b).  Repeated reads of pending_bytes is allowed
> + *    and must not compromise the migration data stream provided the user does
> + *    not proceed to the following step.

Add what to do in SAVING|RUNNING if pending bytes is 0?

>  #define VFIO_DEVICE_STATE_SET_ERROR(state) \
> -	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
> -					     VFIO_DEVICE_STATE_RESUMING)
> +	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_ERROR)

We should delete this macro. It only makes sense used in a driver does
not belong in the uapi header.

Thanks,
Jason



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux