Re: [PATCH RFC] vfio: Documentation for the migration region

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 24 Nov 2021 14:40:20 -0400

On Wed, Nov 24, 2021 at 05:55:49PM +0100, Cornelia Huck wrote:

> Yes, defining what we mean by "VCPU RUNNING" and "DIRTY TRACKING" first
> makes the most sense.
> 
> (It also imposes some rules on userspace, doesn't it? Whatever it does,
> the interaction with vfio needs to be at least somewhat similar to what
> QEMU or another VMM would do. I wonder if we need to be more concrete
> here; but let's talk about the basic interface first.)

I don't think we need to have excessive precision here. The main
thrust of this as a spec is to define behaviors which starts at the
'Actions on Set/Clear' section.

This part is informative so everyone has the same picture in their
mind about what it is we are trying to accomplish. This can be a bit
imprecise.

> > I don't think I like this statement - why/where would the overall flow
> > differ?
> 
> What I meant to say: If we give userspace the flexibility to operate
> this, we also must give different device types some flexibility. While
> subchannels will follow the general flow, they'll probably condense/omit
> some steps, as I/O is quite different to PCI there.

I would say no - migration is general, no device type should get to
violate this spec.  Did you have something specific in mind? There is
very little PCI specific here already

> >> > +     Normal operating state
> >> > +  RUNNING, DIRTY TRACKING, VCPU RUNNING
> >> > +     Log DMAs
> >> > +     Stream all memory
> >> 
> >> all memory accessed by the device?
> >
> > In this reference flow this is all VCPU memory. Ie you start global
> > dirty tracking both in VFIO and in the VCPU and start copying all VM
> > memory.
> 
> So, general migration, not just the vfio specific parts?

Sure, as above precision isn't important here, the userspace doing
migration should start streaming whatever state it has covered by
dirty logging here.

> "subtly complicated" captures this well :(

Indeed. Frankly, my observation is the team here has invested a lot of
person hours trying to make sense of this and our well-researched take
'this is a FSM' was substantially different from Alex's version 'this
is control bits'. For the 'control bit' model few seem to understand
it at all, and the driver code is short but deceptively complicated.

> For example, if I interpret your list correctly, the driver should
> prioritize clearing RUNNING over setting SAVING | !RUNNING. What does
> that mean? If RUNNING is cleared, first deal with whatever action that
> triggers, then later check if it is actually a case of setting SAVING |
> !RUNNING, and perform the required actions for that?

Yes.

Since this is not a FSM a change from any two valid device_state
values is completely legal. Many of these involve multiple driver
steps. So all drivers must do the actions in the same order to have a
real ABI.

> Also, does e.g. SAVING | RUNNING mean that both SAVING and RUNNING are
> getting set, or only one of them, if the other was already set?

It always refers to the requested migration_state

> >   SAVING|0 -> SAVING|RUNNING
> >   0|RUNNING -> SAVING|RUNNING
> >   0 -> SAVING|RUNNING

Are all described as userspace requesting a migration_state 
of SAVING | RUNNING

> > For clarity I didn't split things like that. All the continuous
> > behaviors start when the given bits begins and stop when the bits
> > end.
> >
> > Most of the actions talk about changes in the data window
> 
> This might need some better terminology, I did not understand the split
> like that...
> 
> "action trigger" is basically that the driver sets certain bits and a
> certain device action happens. "continuous" means that a certain device
> action is done as long as certain bits are set. Sounds a bit like edge
> triggered vs level triggered to me. What about:

Yes

> - event-triggered actions: bits get set/unset, an action needs to be
>   done

"""Event-triggered actions happen when userspace requests a new
migration_state that differs from the current migration_state. Actions
happen on a bit group basis:"""

> - condition-triggered actions: as long as bits are set/unset, an action
>   needs to be done

"""Continuous actions are in effect so long as the below migration_state bit
   group is active:"""

> >> What does that mean? That the operation setting NDMA in device_state
> >> returns? 
> >
> > Yes, it must be a synchronous behavior.
> 
> To be extra clear: the _setting_ action (e.g. a write), not the
> condition (NDMA set)? Sorry if that sounds nitpicky, but I think we
> should eliminate possible points of confusion early on.

""Whenever the kernel returns with a migration_state of NDMA there can be no
   in progress DMAs.""

> I'm trying to understand this document without looking at the mlx5
> implementation: Somebody using it as a guide needs to be able to
> implement a driver without looking at another driver (unless they prefer
> to work with examples.) Using the mlx5 driver as the basis for
> _writing_ this document makes sense, but it needs to stand on its own.

That may be an ideal that is too hard to reach :(

Thanks,
Jason

Below is where I have things now:

VFIO migration driver API
-------------------------------------------------------------------------------

VFIO drivers that support migration implement a migration control register
called device_state in the struct vfio_device_migration_info which is in its
VFIO_REGION_TYPE_MIGRATION region.

The device_state controls both device action and continuous behaviour.
Setting/clearing bit groups triggers device action, and each bit controls a
continuous device behaviour.

Along with the device_state the migration driver provides a data window which
allows streaming migration data into or out of the device.

A lot of flexibility is provided to userspace in how it operates these
bits. What follows is a reference flow for saving device state in a live
migration, with all features, and an illustration how other external non-VFIO
entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in.

  RUNNING, VCPU_RUNNING
     Normal operating state
  RUNNING, DIRTY_TRACKING, VCPU_RUNNING
     Log DMAs

     Stream all memory
  SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING
     Log internal device changes (pre-copy)

     Stream device state through the migration window

     While in this state repeat as desired:

	Atomic Read and Clear DMA Dirty log

	Stream dirty memory
  SAVING | NDMA | RUNNING, VCPU_RUNNING
     vIOMMU grace state

     Complete all in progress IO page faults, idle the vIOMMU
  SAVING | NDMA | RUNNING
     Peer to Peer DMA grace state

     Final snapshot of DMA dirty log (atomic not required)
  SAVING
     Stream final device state through the migration window

     Copy final dirty data
  0
     Device is halted

and the reference flow for resuming:

  RUNNING
     Issue VFIO_DEVICE_RESET to clear the internal device state
  0
     Device is halted
  RESUMING
     Push in migration data. Data captured during pre-copy should be
     prepended to data captured during SAVING.
  NDMA | RUNNING
     Peer to Peer DMA grace state
  RUNNING, VCPU_RUNNING
     Normal operating state

If the VMM has multiple VFIO devices undergoing migration then the grace
states act as cross device synchronization points. The VMM must bring all
devices to the grace state before advancing past it.

The above reference flows are built around specific requirements on the
migration driver for its implementation of the migration_state input.

Event triggered actions happen when userspace requests a new migration_state
that differs from the current migration_state. Actions happen on a bit group
basis:

 - SAVING | RUNNING
   The device clears the data window and begins streaming 'pre copy' migration
   data through the window. Devices that cannot log internal state changes
   return a 0 length migration stream.

 - SAVING | !RUNNING
   The device captures its internal state that is not covered by internal
   logging, as well as any logged changes.

   The device clears the data window and begins streaming the captured
   migration data through the window. Devices that cannot log internal state
   changes stream all of their device state here.

 - RESUMING
   The data window is cleared, opened and can receive the migration data
   stream.

 - !RESUMING
   All the data transferred into the data window is loaded into the device's
   internal state. The migration driver can rely on userspace issuing a
   VFIO_DEVICE_RESET prior to starting RESUMING.

   To abort a RESUMING issue a VFIO_DEVICE_RESET.

   If the migration data is invalid then the ERROR state must be set.

Continuous actions are in effect when migration_state bit groups are active:

 - RUNNING | NDMA
   The device is not allowed to issue new DMA operations.

   Whenever the kernel returns with a migration_state of NDMA there can be no
   in progress DMAs.

 - !RUNNING
   The device should not change its internal state. Further implies the NDMA
   behavior above.

 - SAVING | !RUNNING
   RESUMING | !RUNNING
   The device may assume there are no incoming MMIO operations.

   Internal state logging can stop.

 - RUNNING
   The device can alter its internal state and must respond to incoming MMIO.

 - SAVING | RUNNING
   The device is logging changes to the internal state.

 - ERROR
   The behavior of the device is largely undefined. The device must be
   recovered by issuing VFIO_DEVICE_RESET or closing the device file
   descriptor.

   However, devices supporting NDMA must behave as though NDMA is asserted
   during ERROR to avoid corrupting other devices or a VM during a failed
   migration.

When multiple bits change in the migration_state they may describe multiple
event triggered actions, and multiple changes to continuous actions.  The
migration driver must process them in a priority order:

 - SAVING | RUNNING
 - NDMA
 - !RUNNING
 - SAVING | !RUNNING
 - RESUMING
 - !RESUMING
 - RUNNING
 - !NDMA

In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
device back to device_state RUNNING. When a migration driver executes this
ioctl it should discard the data window and set migration_state to RUNNING as
part of resetting the device to a clean state. This must happen even if the
migration_state has errored. A freshly opened device FD should always be in
the RUNNING state.

The migration driver has limitations on what device state it can affect. Any
device state controlled by general kernel subsystems must not be changed
during RESUME, and SAVING must tolerate mutation of this state. Change to
externally controlled device state can happen at any time, asynchronously, to
the migration (ie interrupt rebalancing).

Some examples of externally controlled state:
 - MSI-X interrupt page
 - MSI/legacy interrupt configuration
 - Large parts of the PCI configuration space, ie common control bits
 - PCI power management
 - Changes via VFIO_DEVICE_SET_IRQS

During !RUNNING, especially during SAVING and RESUMING, the device may have
limitations on what it can tolerate. An ideal device will discard/return all
ones to all incoming MMIO/PIO operations (exclusive of the external state
above) in !RUNNING. However, devices are free to have undefined behavior if
they receive MMIOs. This includes corrupting/aborting the migration, dirtying
pages, and segfaulting userspace.

However, a device may not compromise system integrity if it is subjected to a
MMIO. It can not trigger an error TLP, it can not trigger a Machine Check, and
it can not compromise device isolation.

There are several edge cases that userspace should keep in mind when
implementing migration:

- Device Peer to Peer DMA. In this case devices are able issue DMAs to each
  other's MMIO regions. The VMM can permit this if it maps the MMIO memory into
  the IOMMU.

  As Peer to Peer DMA is a MMIO touch like any other, it is important that
  userspace suspend these accesses before entering any device_state where MMIO
  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
  device does not support NDMA, and rely on that to guarantee quiet MMIO.

  The Peer to Peer Grace States exist so that all devices may reach RUNNING
  before any device is subjected to a MMIO access.

  Failure to guarentee quiet MMIO may allow a hostile VM to use P2P to violate
  the no-MMIO restriction during SAVING or RESUMING and corrupt the migration on
  devices that cannot protect themselves.

- IOMMU Page faults handled in userspace can occur at any time. A migration
  driver is not required to serialize in-progress page faults. It can assume
  that all page faults are completed before entering SAVING | !RUNNING. Since
  the guest VCPU is required to complete page faults the VMM can accomplish this
  by asserting NDMA | VCPU_RUNNING and clearing all pending page faults before
  clearing VCPU_RUNNING.

  Device that do not support NDMA cannot be configured to generate page faults
  that require the VCPU to complete.

- pre-copy allows the device to implement a dirty log for its internal state.
  During the SAVING | RUNNING state the data window should present the device
  state being logged and during SAVING | !RUNNING the data window should present
  the unlogged device state as well as the changes from the internal dirty log.

  On RESUME these two data streams are concatenated together.

  pre-copy is only concerned with internal device state. External DMAs are
  covered by the seperate DIRTY_TRACKING function.

- Atomic Read and Clear of the DMA log is a HW feature. If the tracker
  cannot support this, then NDMA could be used to synthesize it less
  efficiently.

- NDMA is optional, if the device does not support this then the NDMA States
  are pushed down to the next step in the sequence and various behaviors that
  rely on NDMA cannot be used.

- Migration control registers inside the same iommu_group as the VFIO device.
  This immediately raises a security concern as userspace can use Peer to Peer
  DMA to manipulate these migration control registers concurrently with
  any kernel actions.

  A device driver operating such a device must ensure that kernel integrity
  can not be broken by hostile user space operating the migration MMIO
  registers via peer to peer, at any point in the sequence. Notably the kernel
  cannot use DMA to transfer any migration data.

  However, as discussed above in the "Device Peer to Peer DMA" section, it can
  assume quiet MMIO as a condition to have a successful and uncorrupted
  migration.

To elaborate details on the reference flows, they assume the following details
about the external behaviors:

 - !VCPU_RUNNING
   Userspace must not generate dirty pages or issue MMIO operations to devices.
   For a VMM this would typically be a control toward KVM.

 - DIRTY_TRACKING
   Clear the DMA log and start DMA logging

   DMA logs should be readable with an "atomic test and clear" to allow
   continuous non-disruptive sampling of the log.

   This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container
   fd.

 - !DIRTY_TRACKING
   Freeze the DMA log, stop tracking and allow userspace to read it.

   If userspace is going to have any use of the dirty log it must ensure ensure
   that all DMA is suspended before clearing DIRTY_TRACKING, for instance by
   using NDMA or !RUNNING on all VFIO devices.