RE: [PATCH RFC] vfio: Revise and update the migration uAPI description

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Wed, 26 Jan 2022 01:49:09 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Wednesday, January 26, 2022 9:33 AM
> 
> On Wed, Jan 26, 2022 at 01:17:26AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > Sent: Tuesday, January 25, 2022 9:12 PM
> > >
> > > On Tue, Jan 25, 2022 at 03:55:31AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > > > Sent: Saturday, January 15, 2022 3:35 AM
> > > > > + *
> > > > > + *   The peer to peer (P2P) quiescent state is intended to be a
> quiescent
> > > > > + *   state for the device for the purposes of managing multiple
> devices
> > > > > within
> > > > > + *   a user context where peer-to-peer DMA between devices may be
> > > active.
> > > > > The
> > > > > + *   PRE_COPY_P2P and RUNNING_P2P states must prevent the device
> > > from
> > > > > + *   initiating any new P2P DMA transactions. If the device can
> identify
> > > P2P
> > > > > + *   transactions then it can stop only P2P DMA, otherwise it must
> stop
> > > all
> > > > > + *   DMA.  The migration driver must complete any such outstanding
> > > > > operations
> > > > > + *   prior to completing the FSM arc into either P2P state.
> > > > > + *
> > > >
> > > > Now NDMA is renamed to P2P... but we did discuss the potential
> > > > usage of using this state on devices which cannot stop DMA quickly
> > > > thus needs to drain pending page requests which further requires
> > > > running vCPUs if the fault is on guest I/O page table.
> > >
> > > I think this needs to be fleshed out more before we can add it,
> > > ideally along with a driver and some qemu implementation
> >
> > Yes. We have internal implementation but it has to be cleaned up
> > based on this new proposal.
> >
> > >
> > > It looks like the qemu part for this will not be so easy..
> > >
> >
> > My point is that we know that usage in the radar (though it needs more
> > discussion with real example) then does it make sense to make the
> > current name more general? I'm not sure how many devices can figure
> > out P2P from normal DMAs. If most devices have to stop all DMAs to
> > meet the requirement, calling it a name about stopping all DMAs doesn't
> > hurt the current P2P requirement and is more extensible to cover other
> > stop-dma requirements.
> 
> Except you are not talking about stopping all DMAs, you are talking
> about a state that might hang indefinately waiting for a vPRI to
> complete
> 
> In my mind this is completely different, and may motivate another
> state in the graph
> 
>   PRE_COPY -> PRE_COPY_STOP_PRI -> PRE_COPY_STOP_P2P -> STOP_COPY
> 
> As STOP_PRI can be defined as halting any new PRIs and always return
> immediately.

The problem is that on such devices PRIs are continuously triggered
when the driver tries to drain the in-fly requests to enter STOP_P2P
or STOP_COPY. If we simply halt any new PRIs in STOP_PRI, it
essentially implies no migration support for such device.

> 
> STOP_P2P can hang if PRI's are open

In earlier discussions we agreed on a timeout mechanism to avoid such
hang issue.

> 
> This affords a pretty clean approach for userspace to conclude the
> open PRIs or decide it has to give up the migration.
> 
> Theoretical future devices that can support aborting PRI would not use
> this state and would have STOP_P2P as also being NO_PRI. On this
> device userspace would somehow abort the PRIs when it reaches
> STOP_COPY.
> 
> Or at least that is one possibility.
> 
> In any event, the v2 is built as Alex and Cornelia were suggesting
> with a minimal base feature set and two optional extensions for P2P
> and PRE_COPY. Adding a 3rd extension for vPRI is completely
> reasonable.

I'm fine if adding a 3rd extension works. But here imho the requirement
can be translated into that the user expects to stop all DMAs while 
vCPU is running. If PRIs are triggered in that operation, then it will be 
handled by the running vCPU. If any corner case blocks it, the timeout
mechanism allows aborting the migration process.

> 
> Further, from what I can understand devices doing PRI are incompatible
> with the base feature set anyhow, as they can not support a RUNNING ->
> STOP_COPY transition without, minimally, completing all the open
> vPRIs. As VMMs implementing the base protocol should stop the vCPU and
> then move the device to STOP_COPY, it is inherently incompatible with
> what you are proposing.

My understanding is that STOP_P2P is entered before stopping vCPU.
If that state can be extended for STOP_DMA, then it's compatible.

> 
> The new vPRI enabled protocol would have to superceed the base
> protocol and eliminate implicit transitions through the VPRI
> maintenance states as these are non-transparent.
> 
> It is all stuff we can do in the FSM model, but it all needs a careful
> think and a FSM design.
> 
> (there is also the interesting question how to even detect this as
> vPRI special cases should only even exist if the device was bound to a
> PRI capable io page table, so a single device may or may not use this
> depending, and at least right now things are assuming these flags are
> static at device setup time, so hurm)
> 

Need more thinking on this part.

Thanks
Kevin