Provide some more complete documentation for the migration region's behavior, specifically focusing on the device_state bits and the whole system view from a VMM. Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx> --- Documentation/driver-api/vfio.rst | 208 +++++++++++++++++++++++++++++- 1 file changed, 207 insertions(+), 1 deletion(-) Alex/Cornelia, here is the first draft of the requested documentation I promised We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism. Our thinking is that NDMA would be implemented like this: +#define VFIO_DEVICE_STATE_NDMA (1 << 3) And a .add_capability ops will be used to signal to userspace driver support: +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA 6 I've described DIRTY TRACKING as a seperate concept here. With the current uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our change in direction this would be per-tracker control, but no semantic change. Upon some agreement we'll include this patch in the next iteration of the mlx5 driver along with the NDMA bits. Thanks, Jason diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst index c663b6f978255b..b28c6fb89ee92f 100644 --- a/Documentation/driver-api/vfio.rst +++ b/Documentation/driver-api/vfio.rst @@ -242,7 +242,213 @@ group and can access them as follows:: VFIO User API ------------------------------------------------------------------------------- -Please see include/linux/vfio.h for complete API documentation. +Please see include/uapi/linux/vfio.h for complete API documentation. + +------------------------------------------------------------------------------- + +VFIO migration driver API +------------------------------------------------------------------------------- + +VFIO drivers that support migration implement a migration control register +called device_state in the struct vfio_device_migration_info which is in its +VFIO_REGION_TYPE_MIGRATION region. + +The device_state triggers device action both when bits are set/cleared and +continuous behavior for each bit. For VMMs they can also control if the VCPUs in +a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY +TRACKING). These two controls are not part of the device_state register, KVM +will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the +container controls dirty tracking. + +Along with the device_state the migration driver provides a data window which +allows streaming migration data into or out of the device. + +A lot of flexibility is provided to userspace in how it operates these bits. The +reference flow for saving device state in a live migration, with all features: + + RUNNING, VCPU_RUNNING + Normal operating state + RUNNING, DIRTY TRACKING, VCPU RUNNING + Log DMAs + Stream all memory + SAVING | RUNNING, DIRTY TRACKING, VCPU RUNNING + Log internal device changes (pre-copy) + Stream device state through the migration window + + While in this state repeat as desired: + Atomic Read and Clear DMA Dirty log + Stream dirty memory + SAVING | NDMA | RUNNING, VCPU RUNNING + vIOMMU grace state + Complete all in progress IO page faults, idle the vIOMMU + SAVING | NDMA | RUNNING + Peer to Peer DMA grace state + Final snapshot of DMA dirty log (atomic not required) + SAVING + Stream final device state through the migration window + Copy final dirty data + 0 + Device is halted + +and the reference flow for resuming: + + RUNNING + Issue VFIO_DEVICE_RESET to clear the internal device state + 0 + Device is halted + RESUMING + Push in migration data. Data captured during pre-copy should be + prepended to data captured during SAVING. + NDMA | RUNNING + Peer to Peer DMA grace state + RUNNING, VCPU RUNNING + Normal operating state + +If the VMM has multiple VFIO devices undergoing migration then the grace states +act as cross device synchronization points. The VMM must bring all devices to +the grace state before advancing past it. + +To support these operations the migration driver is required to implement +specific behaviors around the device_state. + +Actions on Set/Clear: + - SAVING | RUNNING + The device clears the data window and begins streaming 'pre copy' migration + data through the window. Device that cannot log internal state changes return + a 0 length migration stream. + + - SAVING | !RUNNING + The device captures its internal state and begins streaming migration data + through the migration window + + - RESUMING + The data window is opened and can receive the migration data. + + - !RESUMING + All the data transferred into the data window is loaded into the device's + internal state. The migration driver can rely on userspace issuing a + VFIO_DEVICE_RESET prior to starting RESUMING. + + - DIRTY TRACKING + On set clear the DMA log and start logging + + On clear freeze the DMA log and allow userspace to read it. Userspace must + take care to ensure that DMA is suspended before clearing DIRTY TRACKING, for + instance by using NDMA. + + DMA logs should be readable with an "atomic test and clear" to allow + continuous non-disruptive sampling of the log. + +Continuous Actions: + - NDMA + The device is not allowed to issue new DMA operations. + Before NDMA returns all in progress DMAs must be completed. + + - !RUNNING + The device should not change its internal state. Implies NDMA. Any internal + state logging can stop. + + - SAVING | !RUNNING + RESUMING | !RUNNING + The device may assume there are no incoming MMIO operations. + + - RUNNING + The device can alter its internal state and must respond to incoming MMIO. + + - SAVING | RUNNING + The device is logging changes to the internal state. + + - !VCPU RUNNING + The CPU must not generate dirty pages or issue MMIO operations to devices. + + - DIRTY TRACKING + DMAs are logged + + - ERROR + The behavior of the device is undefined. The device must be recovered by + issuing VFIO_DEVICE_RESET. + +In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the device +back to device_state RUNNING. When a migration driver executes this ioctl it +should discard the data window and set migration_state to RUNNING. This must +happen even if the migration_state has errored. A freshly opened device FD +should always be in the RUNNING state. + +The migration driver has limitations on what device state it can affect. Any +device state controlled by general kernel subsystems must not be changed during +RESUME, and SAVING must tolerate mutation of this state. Change to externally +controlled device state can happen at any time, asynchronously, to the migration +(ie interrupt rebalancing). + +Some examples of externally controlled state: + - MSI-X interrupt page + - MSI/legacy interrupt configuration + - Large parts of the PCI configuration space, ie common control bits + - PCI power management + - Changes via VFIO_DEVICE_SET_IRQS + +During !RUNNING, especially during SAVING and RESUMING, the device may have +limitations on what it can tolerate. An ideal device will discard/return all +ones to all incoming MMIO/PIO operations (exclusive of the external state above) +in !RUNNING. However, devices are free to have undefined behavior if they +receive MMIOs. This includes corrupting/aborting the migration, dirtying pages, +and segfaulting userspace. + +However, a device may not compromise system integrity if it is subjected to a +MMIO. It can not trigger an error TLP, it can not trigger a Machine Check, and +it can not compromise device isolation. + +There are several edge cases that userspace should keep in mind when +implementing migration: + +- Device Peer to Peer DMA. In this case devices are able issue DMAs to each + other's MMIO regions. The VMM can permit this if it maps the MMIO memory into + the IOMMU. + + As Peer to Peer DMA is a MMIO touch like any other, it is important that + userspace suspend these accesses before entering any device_state where MMIO + is not permitted, such as !RUNNING. This can be accomplished with the NDMA + state. Userspace may also choose to remove MMIO mappings from the IOMMU if the + device does not support NDMA, and rely on that to guarantee quiet MMIO. + + The P2P Grace States exist so that all devices may reach RUNNING before any + device is subjected to a MMIO access. + + Failure to guarentee quiet MMIO may allow a hostile VM to use P2P to violate + the no-MMIO restriction during SAVING and corrupt the migration on devices + that cannot protect themselves. + +- IOMMU Page faults handled in userspace can occur at any time. A migration + driver is not required to serialize in-progress page faults. It can assume + that all page faults are completed before entering SAVING | !RUNNING. Since + the guest VCPU is required to complete page faults the VMM can accomplish this + by asserting NDMA | VCPU_RUNNING and clearing all pending page faults before + clearing VCPU_RUNNING. + + Device that do not support NDMA cannot be configured to generate page faults + that require the VCPU to complete. + +- pre-copy allows the device to implement a dirty log for its internal state. + During the SAVING | RUNNING state the data window should present the device + state being logged and during SAVING | !RUNNING the data window should present + the unlogged device state as well as the changes from the internal dirty log. + + On RESUME these two data streams are concatenated together. + + pre-copy is only concerned with internal device state. External DMAs are + covered by the DIRTY TRACK function. + +- Atomic Read and Clear of the DMA log is a HW feature. If the tracker + cannot support this, then NDMA could be used to synthesize it less + efficiently. + +- NDMA is optional, if the device does not support this then the NDMA States + are pushed down to the next step in the sequence and various behaviors that + rely on NDMA cannot be used. + +TDB - discoverable feature flag for NDMA +TDB IMS xlation +TBD PASID xlation VFIO bus driver API ------------------------------------------------------------------------------- base-commit: ae0351a976d1880cf152de2bc680f1dff14d9049 -- 2.33.1