On Wed, Nov 24, 2021 at 05:55:49PM +0100, Cornelia Huck wrote: > Yes, defining what we mean by "VCPU RUNNING" and "DIRTY TRACKING" first > makes the most sense. > > (It also imposes some rules on userspace, doesn't it? Whatever it does, > the interaction with vfio needs to be at least somewhat similar to what > QEMU or another VMM would do. I wonder if we need to be more concrete > here; but let's talk about the basic interface first.) I don't think we need to have excessive precision here. The main thrust of this as a spec is to define behaviors which starts at the 'Actions on Set/Clear' section. This part is informative so everyone has the same picture in their mind about what it is we are trying to accomplish. This can be a bit imprecise. > > I don't think I like this statement - why/where would the overall flow > > differ? > > What I meant to say: If we give userspace the flexibility to operate > this, we also must give different device types some flexibility. While > subchannels will follow the general flow, they'll probably condense/omit > some steps, as I/O is quite different to PCI there. I would say no - migration is general, no device type should get to violate this spec. Did you have something specific in mind? There is very little PCI specific here already > >> > + Normal operating state > >> > + RUNNING, DIRTY TRACKING, VCPU RUNNING > >> > + Log DMAs > >> > + Stream all memory > >> > >> all memory accessed by the device? > > > > In this reference flow this is all VCPU memory. Ie you start global > > dirty tracking both in VFIO and in the VCPU and start copying all VM > > memory. > > So, general migration, not just the vfio specific parts? Sure, as above precision isn't important here, the userspace doing migration should start streaming whatever state it has covered by dirty logging here. > "subtly complicated" captures this well :( Indeed. Frankly, my observation is the team here has invested a lot of person hours trying to make sense of this and our well-researched take 'this is a FSM' was substantially different from Alex's version 'this is control bits'. For the 'control bit' model few seem to understand it at all, and the driver code is short but deceptively complicated. > For example, if I interpret your list correctly, the driver should > prioritize clearing RUNNING over setting SAVING | !RUNNING. What does > that mean? If RUNNING is cleared, first deal with whatever action that > triggers, then later check if it is actually a case of setting SAVING | > !RUNNING, and perform the required actions for that? Yes. Since this is not a FSM a change from any two valid device_state values is completely legal. Many of these involve multiple driver steps. So all drivers must do the actions in the same order to have a real ABI. > Also, does e.g. SAVING | RUNNING mean that both SAVING and RUNNING are > getting set, or only one of them, if the other was already set? It always refers to the requested migration_state > > SAVING|0 -> SAVING|RUNNING > > 0|RUNNING -> SAVING|RUNNING > > 0 -> SAVING|RUNNING Are all described as userspace requesting a migration_state of SAVING | RUNNING > > For clarity I didn't split things like that. All the continuous > > behaviors start when the given bits begins and stop when the bits > > end. > > > > Most of the actions talk about changes in the data window > > This might need some better terminology, I did not understand the split > like that... > > "action trigger" is basically that the driver sets certain bits and a > certain device action happens. "continuous" means that a certain device > action is done as long as certain bits are set. Sounds a bit like edge > triggered vs level triggered to me. What about: Yes > - event-triggered actions: bits get set/unset, an action needs to be > done """Event-triggered actions happen when userspace requests a new migration_state that differs from the current migration_state. Actions happen on a bit group basis:""" > - condition-triggered actions: as long as bits are set/unset, an action > needs to be done """Continuous actions are in effect so long as the below migration_state bit group is active:""" > >> What does that mean? That the operation setting NDMA in device_state > >> returns? > > > > Yes, it must be a synchronous behavior. > > To be extra clear: the _setting_ action (e.g. a write), not the > condition (NDMA set)? Sorry if that sounds nitpicky, but I think we > should eliminate possible points of confusion early on. ""Whenever the kernel returns with a migration_state of NDMA there can be no in progress DMAs."" > I'm trying to understand this document without looking at the mlx5 > implementation: Somebody using it as a guide needs to be able to > implement a driver without looking at another driver (unless they prefer > to work with examples.) Using the mlx5 driver as the basis for > _writing_ this document makes sense, but it needs to stand on its own. That may be an ideal that is too hard to reach :( Thanks, Jason Below is where I have things now: VFIO migration driver API ------------------------------------------------------------------------------- VFIO drivers that support migration implement a migration control register called device_state in the struct vfio_device_migration_info which is in its VFIO_REGION_TYPE_MIGRATION region. The device_state controls both device action and continuous behaviour. Setting/clearing bit groups triggers device action, and each bit controls a continuous device behaviour. Along with the device_state the migration driver provides a data window which allows streaming migration data into or out of the device. A lot of flexibility is provided to userspace in how it operates these bits. What follows is a reference flow for saving device state in a live migration, with all features, and an illustration how other external non-VFIO entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in. RUNNING, VCPU_RUNNING Normal operating state RUNNING, DIRTY_TRACKING, VCPU_RUNNING Log DMAs Stream all memory SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING Log internal device changes (pre-copy) Stream device state through the migration window While in this state repeat as desired: Atomic Read and Clear DMA Dirty log Stream dirty memory SAVING | NDMA | RUNNING, VCPU_RUNNING vIOMMU grace state Complete all in progress IO page faults, idle the vIOMMU SAVING | NDMA | RUNNING Peer to Peer DMA grace state Final snapshot of DMA dirty log (atomic not required) SAVING Stream final device state through the migration window Copy final dirty data 0 Device is halted and the reference flow for resuming: RUNNING Issue VFIO_DEVICE_RESET to clear the internal device state 0 Device is halted RESUMING Push in migration data. Data captured during pre-copy should be prepended to data captured during SAVING. NDMA | RUNNING Peer to Peer DMA grace state RUNNING, VCPU_RUNNING Normal operating state If the VMM has multiple VFIO devices undergoing migration then the grace states act as cross device synchronization points. The VMM must bring all devices to the grace state before advancing past it. The above reference flows are built around specific requirements on the migration driver for its implementation of the migration_state input. Event triggered actions happen when userspace requests a new migration_state that differs from the current migration_state. Actions happen on a bit group basis: - SAVING | RUNNING The device clears the data window and begins streaming 'pre copy' migration data through the window. Devices that cannot log internal state changes return a 0 length migration stream. - SAVING | !RUNNING The device captures its internal state that is not covered by internal logging, as well as any logged changes. The device clears the data window and begins streaming the captured migration data through the window. Devices that cannot log internal state changes stream all of their device state here. - RESUMING The data window is cleared, opened and can receive the migration data stream. - !RESUMING All the data transferred into the data window is loaded into the device's internal state. The migration driver can rely on userspace issuing a VFIO_DEVICE_RESET prior to starting RESUMING. To abort a RESUMING issue a VFIO_DEVICE_RESET. If the migration data is invalid then the ERROR state must be set. Continuous actions are in effect when migration_state bit groups are active: - RUNNING | NDMA The device is not allowed to issue new DMA operations. Whenever the kernel returns with a migration_state of NDMA there can be no in progress DMAs. - !RUNNING The device should not change its internal state. Further implies the NDMA behavior above. - SAVING | !RUNNING RESUMING | !RUNNING The device may assume there are no incoming MMIO operations. Internal state logging can stop. - RUNNING The device can alter its internal state and must respond to incoming MMIO. - SAVING | RUNNING The device is logging changes to the internal state. - ERROR The behavior of the device is largely undefined. The device must be recovered by issuing VFIO_DEVICE_RESET or closing the device file descriptor. However, devices supporting NDMA must behave as though NDMA is asserted during ERROR to avoid corrupting other devices or a VM during a failed migration. When multiple bits change in the migration_state they may describe multiple event triggered actions, and multiple changes to continuous actions. The migration driver must process them in a priority order: - SAVING | RUNNING - NDMA - !RUNNING - SAVING | !RUNNING - RESUMING - !RESUMING - RUNNING - !NDMA In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the device back to device_state RUNNING. When a migration driver executes this ioctl it should discard the data window and set migration_state to RUNNING as part of resetting the device to a clean state. This must happen even if the migration_state has errored. A freshly opened device FD should always be in the RUNNING state. The migration driver has limitations on what device state it can affect. Any device state controlled by general kernel subsystems must not be changed during RESUME, and SAVING must tolerate mutation of this state. Change to externally controlled device state can happen at any time, asynchronously, to the migration (ie interrupt rebalancing). Some examples of externally controlled state: - MSI-X interrupt page - MSI/legacy interrupt configuration - Large parts of the PCI configuration space, ie common control bits - PCI power management - Changes via VFIO_DEVICE_SET_IRQS During !RUNNING, especially during SAVING and RESUMING, the device may have limitations on what it can tolerate. An ideal device will discard/return all ones to all incoming MMIO/PIO operations (exclusive of the external state above) in !RUNNING. However, devices are free to have undefined behavior if they receive MMIOs. This includes corrupting/aborting the migration, dirtying pages, and segfaulting userspace. However, a device may not compromise system integrity if it is subjected to a MMIO. It can not trigger an error TLP, it can not trigger a Machine Check, and it can not compromise device isolation. There are several edge cases that userspace should keep in mind when implementing migration: - Device Peer to Peer DMA. In this case devices are able issue DMAs to each other's MMIO regions. The VMM can permit this if it maps the MMIO memory into the IOMMU. As Peer to Peer DMA is a MMIO touch like any other, it is important that userspace suspend these accesses before entering any device_state where MMIO is not permitted, such as !RUNNING. This can be accomplished with the NDMA state. Userspace may also choose to remove MMIO mappings from the IOMMU if the device does not support NDMA, and rely on that to guarantee quiet MMIO. The Peer to Peer Grace States exist so that all devices may reach RUNNING before any device is subjected to a MMIO access. Failure to guarentee quiet MMIO may allow a hostile VM to use P2P to violate the no-MMIO restriction during SAVING or RESUMING and corrupt the migration on devices that cannot protect themselves. - IOMMU Page faults handled in userspace can occur at any time. A migration driver is not required to serialize in-progress page faults. It can assume that all page faults are completed before entering SAVING | !RUNNING. Since the guest VCPU is required to complete page faults the VMM can accomplish this by asserting NDMA | VCPU_RUNNING and clearing all pending page faults before clearing VCPU_RUNNING. Device that do not support NDMA cannot be configured to generate page faults that require the VCPU to complete. - pre-copy allows the device to implement a dirty log for its internal state. During the SAVING | RUNNING state the data window should present the device state being logged and during SAVING | !RUNNING the data window should present the unlogged device state as well as the changes from the internal dirty log. On RESUME these two data streams are concatenated together. pre-copy is only concerned with internal device state. External DMAs are covered by the seperate DIRTY_TRACKING function. - Atomic Read and Clear of the DMA log is a HW feature. If the tracker cannot support this, then NDMA could be used to synthesize it less efficiently. - NDMA is optional, if the device does not support this then the NDMA States are pushed down to the next step in the sequence and various behaviors that rely on NDMA cannot be used. - Migration control registers inside the same iommu_group as the VFIO device. This immediately raises a security concern as userspace can use Peer to Peer DMA to manipulate these migration control registers concurrently with any kernel actions. A device driver operating such a device must ensure that kernel integrity can not be broken by hostile user space operating the migration MMIO registers via peer to peer, at any point in the sequence. Notably the kernel cannot use DMA to transfer any migration data. However, as discussed above in the "Device Peer to Peer DMA" section, it can assume quiet MMIO as a condition to have a successful and uncorrupted migration. To elaborate details on the reference flows, they assume the following details about the external behaviors: - !VCPU_RUNNING Userspace must not generate dirty pages or issue MMIO operations to devices. For a VMM this would typically be a control toward KVM. - DIRTY_TRACKING Clear the DMA log and start DMA logging DMA logs should be readable with an "atomic test and clear" to allow continuous non-disruptive sampling of the log. This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container fd. - !DIRTY_TRACKING Freeze the DMA log, stop tracking and allow userspace to read it. If userspace is going to have any use of the dirty log it must ensure ensure that all DMA is suspended before clearing DIRTY_TRACKING, for instance by using NDMA or !RUNNING on all VFIO devices.