Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state

Alex Williamson <alex.williamson@xxxxxxxxxx> · Mon, 10 Jan 2022 10:34:04 -0700

On Mon, 10 Jan 2022 07:55:16 +0000
"Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:

> > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > Sent: Friday, January 7, 2022 5:21 AM
> > 
> > We were also thinking to retain STOP. SAVING -> STOP could possibly
> > serve as the abort path to avoid a double action, and some of the use
> > cases you ID'd below are achievable if STOP remains.  
> 
> what is the exact difference between a null state {} (implying !RUNNING)
> and STOP in this context?
> 
> If they are different, who (user or driver) should conduct and when do 
> we expect transitioning the device into a null state?

Sorry if I added confusion here, the null, ie. {}, state fit my
notation better.  The null state is simply no bits set in device_state,
it's equivalent to "STOP" without coming up with a new name for every
set of bit combinations.

> > > We have 20 possible transitions.  I've marked those available via the
> > > "odd ascii art" diagram as (a), that's 7 transitions.  We could
> > > essentially remove the NULL state as unreachable as there seems little
> > > value in the 2 transitions marked (a)* if we look only at migration,
> > > that would bring us down to 5 of 12 possible transitions.  We need to
> > > give userspace an abort path though, so we minimally need the 2
> > > transitions marked (b) (7/12).  
> >   
> > > So now we can discuss the remaining 5 transitions:
> > >
> > > {SAVING} -> {RESUMING}
> > > 	If not supported, user can achieve this via:
> > > 		{SAVING}->{RUNNING}->{RESUMING}
> > > 		{SAVING}-RESET->{RUNNING}->{RESUMING}  
> > 
> > This can be:
> > 
> > SAVING -> STOP -> RESUMING  
> 
> From Alex's original description the default device state is RUNNING.
> This supposed to be the initial state on the dest machine for the
> device assigned to Qemu before Qemu resumes the device state.
> Then how do we eliminate the RUNNING state in above flow? Who
> makes STOP as the initial state on the dest node?

The device must be RUNNING by default.  This is a requirement that
introduction of migration support for a device cannot break
compatibility with existing userspace that may no support migration
features.  It would be QEMU's responsibility to transition a migration
target device from the default state to a state to accept a migration.
There's no discussion here of eliminating the {RUNNING}->{RESUMING}
transition.

> > > drivers follow the previously provided pseudo algorithm with the
> > > requirement that they cannot pass through an invalid state, we need to
> > > formally address whether the NULL state is invalid or just not
> > > reachable by the user.  
> > 
> > What is a NULL state?  
> 
> Hah, seems I'm not the only one having this confusion. 😊

Sorry again, I thought it could easily be deduced by including it in
the state transitions.

> > We have defined (from memory, forgive me I don't have access to
> > Yishai's latest code at the moment) 8 formal FSM states:
> > 
> >  RUNNING
> >  PRECOPY
> >  PRECOPY_NDMA
> >  STOP_COPY
> >  STOP
> >  RESUMING
> >  RESUMING_NDMA
> >  ERROR (perhaps MUST_RESET)  
> 
> ERROR->SHUTDOWN? Usually a shutdown state implies reset required...

The userspace process can go away at any time, what exactly is a
"SHUTDOWN" state representing?

> > > But I think you've identified two classes of DMA, MSI and everything
> > > else.  The device can assume that an MSI is special and not included in
> > > NDMA, but it can't know whether other arbitrary DMAs are p2p or memory.
> > > If we define that the minimum requirement for multi-device migration is
> > > to quiesce p2p DMA, ex. by not allowing it at all, then NDMA is
> > > actually significantly more restrictive while it's enabled.  
> > 
> > You are right, but in any practical physical device NDMA will be
> > implemented by halting all DMAs, not just p2p ones.
> > 
> > I don't mind what we label this, so long as we understand that halting
> > all DMA is a valid device implementation.
> > 
> > Actually, having reflected on this now, one of the things on my list
> > to fix in iommufd, is that mdevs can get access to P2P pages at all.
> > 
> > This is currently buggy as-is because they cannot DMA map these
> > things, touch them with the CPU and kmap, or do, really, anything with
> > them.  
> 
> Can you elaborate why mdev cannot access p2p pages?
> 
> > 
> > So we should be blocking mdev's from accessing P2P, and in that case a
> > mdev driver can quite rightly say it doesn't support P2P at all and
> > safely NOP NDMA if NDMA is defined to only impact P2P transactions.
> > 
> > Perhaps that answers the question for the S390 drivers as well.
> >   
> > > Should a device in the ERROR state continue operation or be in a
> > > quiesced state?  It seems obvious to me that since the ERROR state is
> > > essentially undefined, the device should cease operations and be
> > > quiesced by the driver.  If the device is continuing to operate in the
> > > previous state, why would the driver place the device in the ERROR
> > > state?  It should have returned an errno and left the device in the
> > > previous state.  
> > 
> > What we found while implementing is the use of ERROR arises when the
> > driver has been forced to do multiple actions and is unable to fully
> > unwind the state. So the device_state is not fully the original state
> > or fully the target state. Thus it is ERROR.
> > 
> > The additional requirement that the driver do another action to
> > quiesce the device only introduces the possiblity for triple failure.
> > 
> > Since it doesn't bring any value to userspace, I prefer we not define
> > things in this complicated way.  
> 
> So ERROR is really about unrecoverable failures. If recoverable suppose
> errno should have been returned and the device is still in the original
> state. Is this understanding correct?

Yes, that's how I understand it.  The ERROR state should be used if the
driver can neither provide the requested new state nor remain/return to
the old state.

> btw which errno indicates to the user that the device is back to the
> original state or in the ERROR state? or want the user to always check
> the device state upon any transition error?

No such encoding is defined or required, per the existing uAPI the user
is to re-evaluate device_state on any non-zero errno.  Thanks,

Alex