Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 29 Sep 2021 15:26:42 -0300

On Wed, Sep 29, 2021 at 12:06:55PM -0600, Alex Williamson wrote:
> On Wed, 29 Sep 2021 13:16:02 -0300
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> 
> > On Tue, Sep 28, 2021 at 02:18:44PM -0600, Alex Williamson wrote:
> > > On Tue, 28 Sep 2021 16:35:50 -0300
> > > Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > >   
> > > > On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote:
> > > >   
> > > > > In defining the device state, we tried to steer away from defining it
> > > > > in terms of the QEMU migration API, but rather as a set of controls
> > > > > that could be used to support that API to leave us some degree of
> > > > > independence that QEMU implementation might evolve.    
> > > > 
> > > > That is certainly a different perspective, it would have been
> > > > better to not express this idea as a FSM in that case...
> > > > 
> > > > So each state in mlx5vf_pci_set_device_state() should call the correct
> > > > combination of (un)freeze, (un)quiesce and so on so each state
> > > > reflects a defined operation of the device?  
> > > 
> > > I'd expect so, for instance the implementation of entering the _STOP
> > > state presumes a previous state that where the device is apparently
> > > already quiesced.  That doesn't support a direct _RUNNING -> _STOP
> > > transition where I argued in the linked threads that those states
> > > should be reachable from any other state.  Thanks,  
> > 
> > If we focus on mlx5 there are two device 'flags' to manage:
> >  - Device cannot issue DMAs
> >  - Device internal state cannot change (ie cannot receive DMAs)
> > 
> > This is necessary to co-ordinate across multiple devices that might be
> > doing peer to peer DMA between them. The whole multi-device complex
> > should be moved to "cannot issue DMA's" then the whole complex would
> > go to "state cannot change" and be serialized.
> 
> Are you anticipating p2p from outside the VM?  The typical scenario
> here would be that p2p occurs only intra-VM, so all the devices would
> stop issuing DMA (modulo trying to quiesce devices simultaneously).

Inside the VM.

Your 'modulo trying to quiesce devices simultaneously' is correct -
this is a real issue that needs to be solved.

If we put one device in a state where it's internal state is immutable
it can no longer accept DMA messages from the other devices. So there
are two states in the HW model - do not generate DMAs and finally the
immutable internal state where even external DMAs are refused.

> > The expected sequence at the device is thus
> > 
> > Resuming
> >  full stop -> does not issue DMAs -> full operation
> > Suspend
> >  full operation -> does not issue DMAs -> full stop
> > 
> > Further the device has two actions
> >  - Trigger serializating the device state
> >  - Trigger de-serializing the device state
> > 
> > So, what is the behavior upon each state:
> > 
> >  *  000b => Device Stopped, not saving or resuming
> >      Does not issue DMAs
> >      Internal state cannot change
> > 
> >  *  001b => Device running, which is the default state
> >      Neither flags
> > 
> >  *  010b => Stop the device & save the device state, stop-and-copy state
> >      Does not issue DMAs
> >      Internal state cannot change
> > 
> >  *  011b => Device running and save the device state, pre-copy state
> >      Neither flags
> >      (future, DMA tracking turned on)
> > 
> >  *  100b => Device stopped and the device state is resuming
> >      Does not issue DMAs
> >      Internal state cannot change
> 
> cannot change... other that as loaded via migration region.

Yes

> The expected protocol is that if the user write to the device_state
> register returns an errno, the user reevaluates the device_state to
> determine if the desired transition is unavailable (previous state
> value is returned) or generated a fault (error state value
> returned).

Hmm, interesting, mlx5 should be doing this as well. Eg resuming with
corrupt state should fail and cannot be recovered except via reset.

> The 101b state indicates _RUNNING while _RESUMING, which is simply not
> a mode that has been spec'd at this time as it would require some
> mechanism for the device to fault in state on demand.

So lets error on these requests since we don't know what state to put
the device into.

> > The two actions:
> >  trigger serializing the device state
> >    Done when asked to go to 010b ?
> 
> When the _SAVING bit is set.  The exact mechanics depends on the size
> and volatility of the device state.  A GPU might begin in pre-copy
> (011b) to transmit chunks of framebuffer data, recording hashes of
> blocks read by the user to avoid re-sending them during the
> stop-and-copy (010b) phase.  

Here I am talking specifically about mlx5 which does not have a state
capture in pre-copy. So mlx5 should capture state on 010b only, and
the 011b is a NOP.

> >  trigger de-serializing the device state
> >    Done when transition from 100b -> 000b ?
> 
> 100b -> 000b is not a required transition, generally this would be 100b
> -> 001b, ie. end state of _RUNNING vs _STOP.

Sorry, I typo'd it, yes to _RUNNING

> I think the requirement is that de-serialization is complete when the
> _RESUMING bit is cleared.  Whether the driver chooses to de-serialize
> piece-wise as each block of data is written to the device or in bulk
> from a buffer is left to the implementation.  In either case, the
> driver can fail the transition to !_RESUMING if the state is incomplete
> or otherwise corrupt.  It would again be the driver's discretion if
> the device enters the error state or remains in _RESUMING.  If the user
> has no valid state with which to exit the _RESUMING phase, a device
> reset should return the device to _RUNNING with a default initial state.

That makes sense enough.

> > There is a missing state "Stop Active Transactions" which would be
> > only "does not issue DMAs". I've seen a proposal to add that.
> 
> This would be to get all devices to stop issuing DMA while internal
> state can be modified to avoid the synchronization issue of trying to
> stop devices concurrently?  

Yes, as above

> For PCI devices we obviously have the bus master bit to manage that,
> but I could see how a migration extension for such support (perhaps
> even just wired through to BM for PCI) could be useful.  Thanks,

I'm nervous to override the BM bit for something like this, the BM bit
isn't a gentle "please coherently stop what you are doing" it is a
hanbrake the OS pulls to ensure any PCI device becomes quiet.

Thanks,
Jason