Hi Kirti, On 3/19/20 9:16 PM, Kirti Wankhede wrote: > - Defined MIGRATION region type and sub-type. > > - Defined vfio_device_migration_info structure which will be placed at the > 0th offset of migration region to get/set VFIO device related > information. Defined members of structure and usage on read/write access. > > - Defined device states and state transition details. > > - Defined sequence to be followed while saving and resuming VFIO device. > > Signed-off-by: Kirti Wankhede <kwankhede@xxxxxxxxxx> > Reviewed-by: Neo Jia <cjia@xxxxxxxxxx> Please forgive me, I have just discovered v15 was available. hereafter, you will find the 2 main points I feel difficult to understand when reading the documentation. > --- > include/uapi/linux/vfio.h | 227 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 227 insertions(+) > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 9e843a147ead..d0021467af53 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type { > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0xffff) > #define VFIO_REGION_TYPE_GFX (1) > #define VFIO_REGION_TYPE_CCW (2) > +#define VFIO_REGION_TYPE_MIGRATION (3) > > /* sub-types for VFIO_REGION_TYPE_PCI_* */ > > @@ -379,6 +380,232 @@ struct vfio_region_gfx_edid { > /* sub-types for VFIO_REGION_TYPE_CCW */ > #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD (1) > > +/* sub-types for VFIO_REGION_TYPE_MIGRATION */ > +#define VFIO_REGION_SUBTYPE_MIGRATION (1) > + > +/* > + * The structure vfio_device_migration_info is placed at the 0th offset of > + * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related > + * migration information. Field accesses from this structure are only supported > + * at their native width and alignment. Otherwise, the result is undefined and > + * vendor drivers should return an error. > + * > + * device_state: (read/write) > + * - The user application writes to this field to inform the vendor driver > + * about the device state to be transitioned to. > + * - The vendor driver should take the necessary actions to change the > + * device state. After successful transition to a given state, the > + * vendor driver should return success on write(device_state, state) > + * system call. If the device state transition fails, the vendor driver > + * should return an appropriate -errno for the fault condition. > + * - On the user application side, if the device state transition fails, > + * that is, if write(device_state, state) returns an error, read > + * device_state again to determine the current state of the device from > + * the vendor driver. > + * - The vendor driver should return previous state of the device unless > + * the vendor driver has encountered an internal error, in which case > + * the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR. > + * - The user application must use the device reset ioctl to recover the > + * device from VFIO_DEVICE_STATE_ERROR state. If the device is > + * indicated to be in a valid device state by reading device_state, the > + * user application may attempt to transition the device to any valid > + * state reachable from the current state or terminate itself. > + * > + * device_state consists of 3 bits: > + * - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear, > + * it indicates the _STOP state. When the device state is changed to > + * _STOP, driver should stop the device before write() returns. > + * - If bit 1 is set, it indicates the _SAVING state, which means that the > + * driver should start gathering device state information that will be > + * provided to the VFIO user application to save the device's state. > + * - If bit 2 is set, it indicates the _RESUMING state, which means that > + * the driver should prepare to resume the device. Data provided through > + * the migration region should be used to resume the device. > + * Bits 3 - 31 are reserved for future use. To preserve them, the user > + * application should perform a read-modify-write operation on this > + * field when modifying the specified bits. > + * > + * +------- _RESUMING > + * |+------ _SAVING > + * ||+----- _RUNNING > + * ||| > + * 000b => Device Stopped, not saving or resuming > + * 001b => Device running, which is the default state > + * 010b => Stop the device & save the device state, stop-and-copy state > + * 011b => Device running and save the device state, pre-copy state > + * 100b => Device stopped and the device state is resuming > + * 101b => Invalid state > + * 110b => Error state > + * 111b => Invalid state > + * > + * State transitions: > + * > + * _RESUMING _RUNNING Pre-copy Stop-and-copy _STOP > + * (100b) (001b) (011b) (010b) (000b) > + * 0. Running or default state > + * | > + * > + * 1. Normal Shutdown (optional) > + * |------------------------------------->| > + * > + * 2. Save the state or suspend > + * |------------------------->|---------->| > + * > + * 3. Save the state during live migration > + * |----------->|------------>|---------->| > + * > + * 4. Resuming > + * |<---------| > + * > + * 5. Resumed > + * |--------->| > + * > + * 0. Default state of VFIO device is _RUNNNG when the user application starts. > + * 1. During normal shutdown of the user application, the user application may > + * optionally change the VFIO device state from _RUNNING to _STOP. This > + * transition is optional. The vendor driver must support this transition but > + * must not require it. > + * 2. When the user application saves state or suspends the application, the > + * device state transitions from _RUNNING to stop-and-copy and then to _STOP. > + * On state transition from _RUNNING to stop-and-copy, driver must stop the > + * device, save the device state and send it to the application through the > + * migration region. The sequence to be followed for such transition is given > + * below. > + * 3. In live migration of user application, the state transitions from _RUNNING > + * to pre-copy, to stop-and-copy, and to _STOP. > + * On state transition from _RUNNING to pre-copy, the driver should start > + * gathering the device state while the application is still running and send > + * the device state data to application through the migration region. > + * On state transition from pre-copy to stop-and-copy, the driver must stop > + * the device, save the device state and send it to the user application > + * through the migration region. > + * Vendor drivers must support the pre-copy state even for implementations > + * where no data is provided to the user before the stop-and-copy state. The > + * user must not be required to consume all migration data before the device > + * transitions to a new state, including the stop-and-copy state. > + * The sequence to be followed for above two transitions is given below. > + * 4. To start the resuming phase, the device state should be transitioned from > + * the _RUNNING to the _RESUMING state. > + * In the _RESUMING state, the driver should use the device state data > + * received through the migration region to resume the device. > + * 5. After providing saved device data to the driver, the application should > + * change the state from _RESUMING to _RUNNING. > + * > + * reserved: > + * Reads on this field return zero and writes are ignored. > + * > + * pending_bytes: (read only) > + * The number of pending bytes still to be migrated from the vendor driver. > + * > + * data_offset: (read only) > + * The user application should read data_offset in the migration region > + * from where the user application should read the device data during the > + * _SAVING state or write the device data during the _RESUMING state. See > + * below for details of sequence to be followed. > + * > + * data_size: (read/write) > + * The user application should read data_size to get the size in bytes of > + * the data copied in the migration region during the _SAVING state and > + * write the size in bytes of the data copied in the migration region > + * during the _RESUMING state. > + * > + * The format of the migration region is as follows: > + * ------------------------------------------------------------------ > + * |vfio_device_migration_info| data section | > + * | | /////////////////////////////// | > + * ------------------------------------------------------------------ > + * ^ ^ > + * offset 0-trapped part data_offset > + * > + * The structure vfio_device_migration_info is always followed by the data > + * section in the region, so data_offset will always be nonzero. The offset > + * from where the data is copied is decided by the kernel driver. The data > + * section can be trapped, mapped, or partitioned, depending on how the kernel > + * driver defines the data section. The data section partition can be defined > + * as mapped by the sparse mmap capability. If mmapped, data_offset should be > + * page aligned, whereas initial section which contains the > + * vfio_device_migration_info structure, might not end at the offset, which is > + * page aligned. The user is not required to access through mmap regardless > + * of the capabilities of the region mmap. > + * The vendor driver should determine whether and how to partition the data > + * section. The vendor driver should return data_offset accordingly. > + * > + * The sequence to be followed for the _SAVING|_RUNNING device state or > + * pre-copy phase and for the _SAVING device state or stop-and-copy phase is as > + * follows: > + * a. Read pending_bytes, indicating the start of a new iteration to get device > + * data. Repeated read on pending_bytes at this stage should have no side > + * effects. > + * If pending_bytes == 0, the user application should not iterate to get data > + * for that device. I do not feel comfortable with the above sentence. In pre-save state, the device is running and I understand nothing prevents from getting new state data even after the pending_bytes reached 0. > + * If pending_bytes > 0, perform the following steps. > + * b. Read data_offset, indicating that the vendor driver should make data > + * available through the data section. The vendor driver should return this > + * read operation only after data is available from (region + data_offset) > + * to (region + data_offset + data_size). > + * c. Read data_size, which is the amount of data in bytes available through > + * the migration region. > + * Read on data_offset and data_size should return the offset and size of > + * the current buffer if the user application reads data_offset and > + * data_size more than once here. > + * d. Read data_size bytes of data from (region + data_offset) from the > + * migration region. > + * e. Process the data. > + * f. Read pending_bytes, which indicates that the data from the previous > + * iteration has been read. If pending_bytes > 0, go to step b. > + * > + * If an error occurs during the above sequence, the vendor driver can return > + * an error code for next read() or write() operation, which will terminate the > + * loop. The user application should then take the next necessary action, for > + * example, failing migration or terminating the user application.> + * > + * The user application can transition from the _SAVING|_RUNNING > + * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the > + * number of pending bytes. The user application should iterate in _SAVING > + * (stop-and-copy) until pending_bytes is 0. > + * > + * The sequence to be followed while _RESUMING device state is as follows: > + * While data for this device is available, repeat the following steps: > + * a. Read data_offset from where the user application should write data. > + * b. Write migration data starting at the migration region + data_offset for > + * the length determined by data_size from the migration source. > + * c. Write data_size, which indicates to the vendor driver that data is > + * written in the migration region. Vendor driver should apply the > + * user-provided migration region data to the device resume state. This is not clear to me when the data gets consumed by the device. Is the write data_size blocking? Is the data offset moving to make sure the user data will not be overriden? Can the the userapp refill immediately? At least some hints about possible implementation would ease the understanding. Thanks Eric > + * > + * For the user application, data is opaque. The user application should write > + * data in the same order as the data is received and the data should be of > + * same transaction size at the source. > + */ > + > +struct vfio_device_migration_info { > + __u32 device_state; /* VFIO device state */ > +#define VFIO_DEVICE_STATE_STOP (0) > +#define VFIO_DEVICE_STATE_RUNNING (1 << 0) > +#define VFIO_DEVICE_STATE_SAVING (1 << 1) > +#define VFIO_DEVICE_STATE_RESUMING (1 << 2) > +#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \ > + VFIO_DEVICE_STATE_SAVING | \ > + VFIO_DEVICE_STATE_RESUMING) > + > +#define VFIO_DEVICE_STATE_VALID(state) \ > + (state & VFIO_DEVICE_STATE_RESUMING ? \ > + (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1) > + > +#define VFIO_DEVICE_STATE_IS_ERROR(state) \ > + ((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \ > + VFIO_DEVICE_STATE_RESUMING)) > + > +#define VFIO_DEVICE_STATE_SET_ERROR(state) \ > + ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \ > + VFIO_DEVICE_STATE_RESUMING) > + > + __u32 reserved; > + __u64 pending_bytes; > + __u64 data_offset; > + __u64 data_size; > +} __attribute__((packed)); > + > /* > * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped > * which allows direct access to non-MSIX registers which happened to be within >