RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Thu, 6 Jan 2022 06:32:57 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Wednesday, January 5, 2022 8:46 PM
> 
> On Wed, Jan 05, 2022 at 01:59:31AM +0000, Tian, Kevin wrote:
> 
> > > This will block the hypervisor from ever migrating the VM in a very
> > > poor way - it will just hang in the middle of a migration request.
> >
> > it's poor but 'hang' won't happen. PCI spec defines completion timeout
> > for ATS translation request. If timeout the device will abort the in-fly
> > request and report error back to software.
> 
> The PRI time outs have to be long enough to handle swap back from
> disk, so 'hang' will be a fair amount of time..

This reminds me one interesting point.

Putting PRI aside the time to drain in-fly requests is undefined. It depends
on how many pending requests to be waited for before completing the
draining command on the device. This is IP specific (e.g. whether supports
preemption) and also guest specific (e.g. whether it's actively submitting
workload).

So even without hostile attempts the draining time may exceed what an
user tolerates in live migration.

This suggests certain software timeout mechanism might be necessary 
when transitioning to NDMA state, with the timeout value optionally
configurable by the user. If timeout, then fail the state transition
request.

And once such mechanism is in place, PRI is automatically covered as it
is just one implicit reason which may increase the draining time.

> 
> > > Regardless of the complaints of the IP designers, this is a very poor
> > > direction.
> > >
> > > Progress in the hypervisor should never be contingent on a guest VM.
> > >
> >
> > Whether the said DOS is a real concern and how severe it is are usage
> > specific things. Why would we want to hardcode such restriction on
> > an uAPI? Just give the choice to the admin (as long as this restriction is
> > clearly communicated to userspace clearly)...
> 
> IMHO it is not just DOS, PRI can become dependent on IO which requires
> DMA to complete.
> 
> You could quickly get yourself into a deadlock situation where the
> hypervisor has disabled DMA activities of other devices and the vPRI
> simply cannot be completed.

How is it related to PRI which is only about address translation?

Instead, above is a general p2p problem for any draining operation. How 
to solve it needs to be defined clearly for this NDMA state (which I suppose
is being discussed between you and Alex and I still need time to catch
up).

> 
> I just don't see how this scheme is generally workable without a lot
> of limitations.
> 
> While I do agree we should support the HW that exists, we should
> recognize this is not a long term workable design and treat it as
> such.
> 

Definitely agree with this point. We software people should continue
influencing IP designers toward a long-term software friendly design.
and also bear the fact that it takes time... 😊

Thanks
Kevin