Re: [PATCH RFC v2] vfio: Documentation for the migration region

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 8 Dec 2021 16:27:37 -0400

On Wed, Dec 08, 2021 at 09:06:47AM -0700, Alex Williamson wrote:
> On Tue, 7 Dec 2021 11:51:45 -0400
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> 
> > On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote:
> > > On Mon, Dec 06 2021, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> > >   
> > > > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote:
> > > >  
> > > >> We're discussing a complex topic here, and we really don't want to
> > > >> perpetuate an unclear uAPI. This is where my push for more precise
> > > >> statements is coming from.  
> > > >
> > > > I appreciate that, and I think we've made a big effort toward that
> > > > direction.
> > > >
> > > > Can we have some crisp feedback which statements need SHOULD/MUST/MUST
> > > > NOT and come to something?  
> > > 
> > > I'm not sure what I should actually comment on, some general remarks:  
> > 
> > You should comment on the paragraphs that prevent you from adding a
> > reviewed-by.
> > 
> > > - If we consider a possible vfio-ccw implementation that will quiesce
> > >   the device and not rely on tracking I/O, we need to make the parts
> > >   that talk about tracking non-mandatory.  
> > 
> > I'm not sure what you mean by 'tracking I/O'?
> > 
> > I thought we were good on ccw?
> > 
> > > - NDMA sounds like something that needs to be non-mandatory as well.  
> > 
> > I agree, Alex are we agreed now ?
> 
> No.  When last we left our thread, you seemed to be suggesting QEMU
> maintains two IOMMU domains, ie. containers, the first of which would
> include p2p mappings for all PCI devices, the second would include no
> p2p mappings.  Device supporting NDMA get attached to the former,
> non-NDMA devices the latter.

I think it is another valid solution, yes, As are the ideas you
identified

It is also OK what you said earlier, that qemu can just permit a guest
to mangle its migration and not do anything. I don't like this, but as
policy it is something userspace could reasonably choose to do.

The question here, right now, is what should the kernel do?

 - NDMA mandatory/not? Given we know Huawei cannot implement it, I vote
   for not mandatory
 - Should we expose some other flag, eg you 'fragile mmio' to help
   userspace form a policy? Might be useful, but also can be defined
   when someone creates a userspace policy that requires it.

Either way kerenl is just reflecting HW capability to userspace, I'm
not seeing how this can be the wrong thing for the kernel to do??

> So some devices can access all devices via p2p DMA, other devices can
> access none.  Are there any bare metal systems that expose such
> asymmetric p2p constraints?  I'm not inclined to invent new p2p
> scenarios that only exist in VMs.

Migration on baremetal is a bit odd, but all the rules are the same.
The baremetal must reach the P2P grace state to avoid corruption and
if it can rely on internal knowledge of how it operates the devices it
can avoid things like NDMA or multiple domains entirely.

> In addition to creating this asymmetric topology, forcing QEMU to
> maintain two containers not only increases the overhead, 

Yes, this policy choice has downsides. I outlined a few options where
qemu can avoid the overheads.

Other choices have different downsides.

I think part of the reason we keep going around this topic is I'm
offering options that a VMM could do, and I can't much help you decide
what matches qemu's policy framework. Frankly I think it is very
operator specific.

> but doubles the locked memory requirements for QEMU since our
> current locked memory accounting is unfortunately per container.

You asked for that to be fixed in iommufd and I've got that almost
done now. With iommufd we can now have two domains with only the
overhead of the actual io page table entries under the 2nd
iommu_domain.

When VDPA uses iommufd as well it should allow qemu to be single pin
for everything.

> Then we need to also consider that multi-device groups exist where a
> group can only be attached to one container 

It is a problem, yes, but again qemu could adopt a policy of not
support multi-group migration devices, if it wants.

> and also vIOMMU cases where presumably we'd only get these
> dual-containers when multiple groups are attached to a container.

Not sure where this means? vIOMMU should operate both containers in
unison, I'd think?

> Maybe also worth noting that we cannot atomically move a device
> between containers, due to both vfio and often IOMMU constraints
> afaik.

I don't think we need move, it is a permanent setup?

> So it still seems like the only QEMU policy that we could manage to
> document and support would require that non-mandatory NDMA support
> implies that migration cannot be enabled by default for any vfio device
> and that enabling migration sets in motion a binary policy regarding
> p2p mappings across the VM.  I'm still not convinced how supportable
> that is, but I can at least imagine explicit per device options that
> need to align.

If this is what qemu community wants, OK.

> I don't know if lack of NDMA on ccw was Connie's reasoning for making
> NDMA non-mandatory, but it seems like NDMA is only relevant to buses
> that support DMA, so AIUI it would be just as valid for ccw devices to
> report NDMA as a no-op.

ccw seems to support DMA, it calls vfio_pin_pages() after all? That is
DMA in this definition.

NDMA here was defined in terms of *any* DMA, not just DMA to certain
addresses, so I expect ccw will probably say it does not support NDMA
for many of the same reasons Huawei can't do it.

So, I think Connie is right to be worried about a direction to require
NDMA as a pre-condition for a migration driver. I suspect something
like CCW is very much migratable with a method that relies on no mmio
touches much like the Huawei driver does. 

In the S390 case I believe "no mmio touches" means 'whatever that
thing inside kvm that causes the device to work' is.

BTW, it is not just VFIO, but VDPA too.

Currently VDPA cannot participate in the P2P mappings so it acts like
the dual domain solution, but with IOMMUFD I expect VDPA to be able to
access the P2P mappings and so it must also support NDMA somehow.

Future things like the "netgpu" patches make it reasonable to combine
VDPA and P2P in a guest.

> > I honestly don't know why this is such a discussion point, beyond
> > being a big oversight of the original design.
> 
> In my case, because it's still not clearly a universal algorithm, yet
> it's being proposed as one.  We already discussed that {!}NDMA
> placement is fairly arbitrary and looking at v3 I'm wondering how a
> RESUMING -> SAVING|!RUNNING transition works.

Ok, I've been struggling since I wrote this to figure out how to
validate the precedence algorithmically, and I think I see it now.

Minimally, the precedence must be ordered so we never transit through
an invalid state.

Ie if userspace asks for

 RESUMING ->  SAVING | !RUNNING

Then the start and end states are valid.

However, we cannot transit through the invalid state 'RESUMING |
SAVING | !RUNNING' on the way to get there. It means the driver cannot
do RESUMING first, and is really the underlying reason why we found we
needed precedence in the first place.

So, my original design idea is just *completely* wrong.

Since we allow 0->ANY the better solution is to go toward 0 first,
then go away from 0:

 - !NDMA
 - !SAVING
 - !RUNNING
 - !RESUMING
 - RESUMING
 - RUNNING
 - SAVING
 - NDMA

and even this I can't tell by inspection if it is OK or not - need a
program to explore all combinations.

Basically, this is too hard. I can possbily figure it out now, but if
someone wants to add another state someday they *will* screw it up.

How do you feel about simply forbidding this corner case entirely?

Require that new_state ^ old_state is one hot.

I have to check, but I think this is incompatible with current
qemu. Maybe we can allow & document the few combinations qemu uses?

A core function can enforce this consistently for drivers. 

??

(frankly, I have decided I absolutely hate this device_state bit field
as a uAPI choice, the fact we have spent so much time on this is just
*grrrrrr*)

> accessed directly.  We can have a valid, interoperable uAPI without
> constraining ourselves to a specific implementation.  Largely I think
> that trying to impose an implementation as the specification is the
> source of our friction.

I agree, but please accept it is not easy to understand what is spec
and what is qemu. It is part of why I want to write this down.

> This is a bit infuriating, responding to it at all is probably ill
> advised.  We're all investing a lot of time into this.  We're all
> disappointed how the open source use case of the previous
> implementation fell apart and nobody else stepped up until now.
> Rubbing salt in that wound is not helpful or productive.

Sorry, I'm not trying to rub salt in a wound - but it is maddening we
are so far down this path and we seem to be back at square one and
debating exactly what the existing uAPI even is.

> Regardless, this implementation has highlighted gaps in the initial
> design and it's critical that those known gaps are addressed before we
> commit to the design with an in-kernel driver.  Referring to the notes
> Connie copied from etherpad, those gaps include uAPI clarification
> regarding various device states and accesses allowed in those states,
> definition of a quiescent (NDMA) device state, discussion of per-device
> dirty state, and documentation such as userspace usage and edge
> cases.

Okay, can summarize your view where we are on all these points? I
really don't know anymore. I think the document addresses all of them.

> Only the latter items were specifically requested outside of the header
> and previously provided comments questioned if we're not actually
> creating contradictory documentation to the uAPI and why clarifications
> are not applied to the existing uAPI descriptions.

As I said, it is simply too much text, there is no simple fix to make
the header file documentation completely clear. If you have some idea
how to fix it please share.

I don't think the new docs are contradictory because the header file
doesn't really say that much.. At least I haven't noticed anything.
Yishai's patch to add NDMA should add it to the header comment a
little bit, I'll check on that.

> Personally I'm a bit disappointed to see v3 posted where the diffstat
> indicates no uAPI updates, so we actually have no formal definition of
> this NDMA state, 

I didn't realize this is something you felt was important for this RFC
patch.

The purpose of the RFC patch is to get some overall mutual agreement
on the uAPI protocol as discussed on the RST text. The non-RFC version
of this RST will come with the next mlx5 driver v6 posting and the
implementation of NDMA for mlx5 and all the cap bits Yishai built for
it. This is already coded - we have only refrained from posting a v6
out of respect for everyone's time to re-look on all the mlx5 code
again if we are still debating big topcis in this text.

If you recall, I said we'd like to go ahead with mlx5 as-is and you
were very concerned about the implication of the NDMA problem. Thus
this document was prepared to explain our view of NDMA and more, plus
we went and updated mlx5 to fully implement it so we will not come
back to this topic later in the kernel.

> get there.  Isn't that how we got into this situation, approving the
> uAPI, or in this case documentation, without an in-kernel
> implementation and vetted userspace?  Thanks,

I'm certainly not proposing we do that, please don't mis-understand.

Jason