[PATCH v3] vfio: Documentation for the migration region

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Provide some more complete documentation for the migration regions
behavior, specifically focusing on the device_state bits and the whole
system view from a VMM.

To: Alex Williamson <alex.williamson@xxxxxxxxxx>
Cc: Cornelia Huck <cohuck@xxxxxxxxxx>
Cc: kvm@xxxxxxxxxxxxxxx
Cc: Max Gurtovoy <mgurtovoy@xxxxxxxxxx>
Cc: Kirti Wankhede <kwankhede@xxxxxxxxxx>
Cc: Yishai Hadas <yishaih@xxxxxxxxxx>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@xxxxxxxxxx>
Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx>
---
 Documentation/driver-api/vfio.rst | 301 +++++++++++++++++++++++++++++-
 1 file changed, 300 insertions(+), 1 deletion(-)

v3:
 - s/migration_state/device_state
 - Redo how the migration data moves to better capture how pre-copy works
 - Require entry to RESUMING to always succeed, without prior reset
 - Move SAVING | RUNNING to after RUNNING in the precedence list
 - Reword the discussion of devices that have migration control registers
   in the same function
v2: https://lore.kernel.org/r/0-v2-45a95932a4c6+37-vfio_mig_doc_jgg@xxxxxxxxxx
 - RST fixups for sphinx rendering
 - Inclue the priority order for multi-bit-changes
 - Add a small discussion on devices like hns with migration control inside
   the same function as is being migrated.
 - Language cleanups from v1, the diff says almost every line was touched in some way
v1: https://lore.kernel.org/r/0-v1-0ec87874bede+123-vfio_mig_doc_jgg@xxxxxxxxxx

diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index c663b6f978255b..2ff47823a889b4 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -242,7 +242,306 @@ group and can access them as follows::
 VFIO User API
 -------------------------------------------------------------------------------
 
-Please see include/linux/vfio.h for complete API documentation.
+Please see include/uapi/linux/vfio.h for complete API documentation.
+
+-------------------------------------------------------------------------------
+
+VFIO migration driver API
+-------------------------------------------------------------------------------
+
+VFIO drivers that support migration implement a migration control register
+called device_state in the struct vfio_device_migration_info which is in its
+VFIO_REGION_TYPE_MIGRATION region.
+
+The device_state controls both device action and continuous behavior.
+Setting/clearing bit groups triggers device action, and each bit controls a
+continuous device behavior.
+
+Along with the device_state the migration driver provides a data window which
+allows streaming migration data into or out of the device. The entire
+migration data, up to the end of stream must be transported from the saving to
+resuming side.
+
+A lot of flexibility is provided to user-space in how it operates these
+bits. What follows is a reference flow for saving device state in a live
+migration, with all features, and an illustration how other external non-VFIO
+entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in.
+
+  RUNNING, VCPU_RUNNING
+     Normal operating state
+  RUNNING, DIRTY_TRACKING, VCPU_RUNNING
+     Log DMAs
+
+     Stream all memory
+  SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING
+     Log internal device changes (pre-copy)
+
+     Stream device state through the migration window
+
+     While in this state repeat as desired:
+
+	Atomic Read and Clear DMA Dirty log
+
+	Stream dirty memory
+  SAVING | NDMA | RUNNING, VCPU_RUNNING
+     vIOMMU grace state
+
+     Complete all in progress IO page faults, idle the vIOMMU
+  SAVING | NDMA | RUNNING
+     Peer to Peer DMA grace state
+
+     Final snapshot of DMA dirty log (atomic not required)
+  SAVING
+     Stream final device state through the migration window
+
+     Copy final dirty data
+  0
+     Device is halted
+
+and the reference flow for resuming:
+
+  RUNNING
+     Use ioctl(VFIO_GROUP_GET_DEVICE_FD) to obtain a fresh device
+  RESUMING
+     Push in migration data.
+  NDMA | RUNNING
+     Peer to Peer DMA grace state
+  RUNNING, VCPU_RUNNING
+     Normal operating state
+
+If the VMM has multiple VFIO devices undergoing migration then the grace
+states act as cross device synchronization points. The VMM must bring all
+devices to the grace state before advancing past it.
+
+The above reference flows are built around specific requirements on the
+migration driver for its implementation of the device_state input.
+
+The device_state cannot change asynchronously, upon writing the
+device_state the driver will either keep the current state and return
+failure, return failure and go to ERROR, or succeed and go to the new state.
+
+Event triggered actions happen when user-space requests a new device_state
+that differs from the current device_state. Actions happen on a bit group
+basis:
+
+ SAVING
+   The device clears the data window and prepares to stream migration data.
+   The entire data from the start of SAVING to the end of stream is transfered
+   to the other side to execute a resume.
+
+ SAVING | RUNNING
+   The device beings streaming 'pre copy' migration data through the window.
+
+   A device that does not support internal state logging should return a 0
+   length stream.
+
+   The migration window may reach an end of stream, this can be a permanent or
+   temporary condition.
+
+   User space can do SAVING | !RUNNING at any time, any in progress transfer
+   through the migration window is carried forward.
+
+   This allows the device to implement a dirty log for its internal state.
+   During this state the data window should present the device state being
+   logged and during SAVING | !RUNNING the data window should transfer the
+   dirtied state and conclude the migration data.
+
+   The state is only concerned with internal device state. External DMAs are
+   covered by the separate DIRTY_TRACKING function.
+
+ SAVING | !RUNNING
+   The device captures its internal state and streams it through the
+   migration window.
+
+   When the migration window reaches an end of stream the saving is concluded
+   and there is no further data. All of the migration data streamed from the
+   time SAVING starts to this final end of stream is concatenated together
+   and provided to RESUMING.
+
+   Devices that cannot log internal state changes stream all their device
+   state here.
+
+ RESUMING
+   The data window is cleared, opened, and can receive the migration data
+   stream. The device must always be able to enter resuming and it may reset
+   the device to do so.
+
+ !RESUMING
+   All the data transferred into the data window is loaded into the device's
+   internal state.
+
+   The internal state of a device is undefined while in RESUMING. To abort a
+   RESUMING and return to a known state issue a VFIO_DEVICE_RESET.
+
+   If the migration data is invalid then the ERROR state must be set.
+
+Continuous actions are in effect when device_state bit groups are active:
+
+ RUNNING | NDMA
+   The device is not allowed to issue new DMA operations.
+
+   Whenever the kernel returns with a device_state of NDMA there can be no
+   in progress DMAs.
+
+ !RUNNING
+   The device should not change its internal state. Further implies the NDMA
+   behavior above.
+
+ SAVING | !RUNNING
+   RESUMING | !RUNNING
+   The device may assume there are no incoming MMIO operations.
+
+   Internal state logging can stop.
+
+ RUNNING
+   The device can alter its internal state and must respond to incoming MMIO.
+
+ SAVING | RUNNING
+   The device is logging changes to the internal state.
+
+ ERROR
+   The behavior of the device is largely undefined. The device must be
+   recovered by issuing VFIO_DEVICE_RESET or closing the device file
+   descriptor.
+
+   However, devices supporting NDMA must behave as though NDMA is asserted
+   during ERROR to avoid corrupting other devices or a VM during a failed
+   migration.
+
+When multiple bits change in the device_state they may describe multiple event
+triggered actions, and multiple changes to continuous actions.  The migration
+driver must process the new device_state bits in a priority order:
+
+ - NDMA
+ - !RUNNING
+ - SAVING | !RUNNING
+ - RESUMING
+ - !RESUMING
+ - RUNNING
+ - SAVING | RUNNING
+ - !NDMA
+
+In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
+device back to device_state RUNNING. When a migration driver executes this
+ioctl it should discard the data window and set device_state to RUNNING as
+part of resetting the device to a clean state. This must happen even if the
+device_state has errored. A freshly opened device FD should always be in
+the RUNNING state.
+
+The migration driver has limitations on what device state it can affect. Any
+device state controlled by general kernel subsystems must not be changed
+during RESUME, and SAVING must tolerate mutation of this state. Change to
+externally controlled device state can happen at any time, asynchronously, to
+the migration (ie interrupt rebalancing).
+
+Some examples of externally controlled state:
+ - MSI-X interrupt page
+ - MSI/legacy interrupt configuration
+ - Large parts of the PCI configuration space, ie common control bits
+ - PCI power management
+ - Changes via VFIO_DEVICE_SET_IRQS
+
+During !RUNNING, especially during SAVING and RESUMING, the device may have
+limitations on what it can tolerate. An ideal device will discard/return all
+ones to all incoming MMIO, PIO, or equivalent operations (exclusive of the
+external state above) in !RUNNING. However, devices are free to have undefined
+behavior if they receive incoming operations. This includes
+corrupting/aborting the migration, dirtying pages, and segfaulting user-space.
+
+However, a device may not compromise system integrity if it is subjected to a
+MMIO. It cannot trigger an error TLP, it cannot trigger an x86 Machine Check
+or similar, and it cannot compromise device isolation.
+
+There are several edge cases that user-space should keep in mind when
+implementing migration:
+
+- Device Peer to Peer DMA. In this case devices are able issue DMAs to each
+  other's MMIO regions. The VMM can permit this if it maps the MMIO memory into
+  the IOMMU.
+
+  As Peer to Peer DMA is a MMIO touch like any other, it is important that
+  userspace suspend these accesses before entering any device_state where MMIO
+  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
+  state. Userspace may also choose to never install MMIO mappings into the
+  IOMMU if devices do not support NDMA and rely on that to guarantee quiet
+  MMIO.
+
+  The Peer to Peer Grace States exist so that all devices may reach RUNNING
+  before any device is subjected to a MMIO access.
+
+  Failure to guarantee quiet MMIO may allow a hostile VM to use P2P to violate
+  the no-MMIO restriction during SAVING or RESUMING and corrupt the migration
+  on devices that cannot protect themselves.
+
+- IOMMU Page faults handled in user-space can occur at any time. A migration
+  driver is not required to serialize in-progress page faults. It can assume
+  that all page faults are completed before entering SAVING | !RUNNING. Since
+  the guest VCPU is required to complete page faults the VMM can accomplish
+  this by asserting NDMA | VCPU_RUNNING and clearing all pending page faults
+  before clearing VCPU_RUNNING.
+
+  Device that do not support NDMA cannot be configured to generate page faults
+  that require the VCPU to complete.
+
+- Atomic Read and Clear of the DMA log is a HW feature. If the tracker
+  cannot support this, then NDMA could be used to synthesize it less
+  efficiently.
+
+- NDMA is optional. If the device does not support this then the NDMA States
+  are pushed down to the next step in the sequence and various behaviors that
+  rely on NDMA cannot be used.
+
+  NDMA is made optional to support simple HW implementations that either just
+  cannot do NDMA, or cannot do NDMA without a performance cost. NDMA is only
+  necessary for special features like P2P and PRI, so devices can omit it in
+  exchange for limitations on the guest.
+
+- Devices that have their HW migration control MMIO registers inside the same
+  iommu_group as the VFIO device have some special considerations. In this
+  case a driver will be operating HW registers from kernel space that are also
+  subjected to userspace controlled DMA due to the iommu_group.
+
+  This immediately raises a security concern as user-space can use Peer to
+  Peer DMA to manipulate these migration control registers concurrently with
+  any kernel actions.
+
+  A device driver operating such a device must ensure that kernel integrity
+  cannot be broken by hostile user space operating the migration MMIO
+  registers via peer to peer, at any point in the sequence. Further the kernel
+  cannot use DMA to transfer any migration data.
+
+  However, as discussed above in the "Device Peer to Peer DMA" section, it can
+  assume quiet MMIO as a condition to have a successful and uncorrupted
+  migration.
+
+To elaborate details on the reference flows, they assume the following details
+about the external behaviors:
+
+ !VCPU_RUNNING
+   User-space must not generate dirty pages or issue MMIO, PIO or equivalent
+   operations to devices.  For a VMM this would typically be controlled by
+   KVM.
+
+ DIRTY_TRACKING
+   Clear the DMA log and start DMA logging
+
+   DMA logs should be readable with an "atomic test and clear" to allow
+   continuous non-disruptive sampling of the log.
+
+   This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container
+   fd.
+
+ !DIRTY_TRACKING
+   Freeze the DMA log, stop tracking and allow user-space to read it.
+
+   If user-space is going to have any use of the dirty log it must ensure that
+   all DMA is suspended before clearing DIRTY_TRACKING, for instance by using
+   NDMA or !RUNNING on all VFIO devices.
+
+
+TDB - discoverable feature flag for NDMA
+TDB IMS xlation
+TBD PASID xlation
 
 VFIO bus driver API
 -------------------------------------------------------------------------------

base-commit: ae0351a976d1880cf152de2bc680f1dff14d9049
-- 
2.34.1




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux