Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 16 Nov 2021 10:57:36 -0700

On Mon, 15 Nov 2021 19:29:21 -0400
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Fri, Nov 05, 2021 at 09:31:45AM -0600, Alex Williamson wrote:
> > On Fri, 5 Nov 2021 10:24:04 -0300
> > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> >   
> > > On Wed, Nov 03, 2021 at 12:04:11PM -0600, Alex Williamson wrote:
> > >   
> > > > We agreed that it's easier to add a feature than a restriction in a
> > > > uAPI, so how do we resolve that some future device may require a new
> > > > state in order to apply the SET_IRQS configuration?    
> > > 
> > > I would say don't support those devices. If there is even a hint that
> > > they could maybe exist then we should fix it now. Once the uapi is set
> > > and documented we should expect device makers to consider it when
> > > building their devices.
> > > 
> > > As for SET_IRQs, I have been looking at making documentation and I
> > > don't like the way the documentation has to be wrriten because of
> > > this.
> > > 
> > > What I see as an understandable, clear, documentation is:
> > > 
> > >  - SAVING set - no device touches allowed beyond migration operations
> > >    and reset via XX  
> > 
> > I'd suggest defining reset via ioctl only.
> >   
> > >    Must be set with !RUNNING  
> > 
> > Not sure what this means.  Pre-copy requires SAVING and RUNNING
> > together, is this only suggesting that to get the final device state we
> > need to do so in a !RUNNING state?  
> 
> Sorry, I did not think about pre-copy here, mlx5 doesn't do it so I'm
> not as familiar
> 
> > >  - RESUMING set - same as SAVING  
> > 
> > I take it then that we're defining a new protocol if we can't do
> > SET_IRQS here.  
> 
> We've been working on some documentation and one of the challenges
> turns out that all the PCI device state owned by other subsystems (eg
> the PCI core, the interrupt code, power management, etc) must be kept
> in sync. No matter what RESUMING cannot just async change device state
> that the kernel assumes it is controlling.
> 
> So, in practice, this necessarily requires forbidding the device from
> touching the MSI table, and other stuff, during RESUMING.
> 
> Further, since we can't just halt all the other kernel subsystems
> during SAVING/RESUMING the device must be able to accept touches in
> those areas, for completely unrelated reasons, (eg a MSI addr/data
> being changed) safely.
> 
> Seems like no need to change SET_IRQs.
> 
> 
> > >  - NDMA set - full device touches
> > >    Device may not issue DMA or interrupts (??)
> > >    Device may not dirty pages  
> > 
> > Is this achievable?  We can't bound the time where incoming DMA is
> > possible, devices don't have infinite buffers.  
> 
> It is a necessary evil for migration. 
> 
> The device cannot know how long it will be suspended for and must
> cope. With networking discarded packets can be resent, but the reality
> is that real deployments need a QOS that the device will not be paused
> for too long otherwise the peers may declare the node dead.
> 
> > > Not entirely, to support P2P going from RESUMING directly to RUNNING
> > > is not possible. There must be an in between state that all devices
> > > reach before they go to RUNNING. It seems P2P cannot be bolted into
> > > the existing qmeu flow with a kernel only change?  
> > 
> > Perhaps, yes.  
> 
> We have also been looking at dirty tracking and we are wondering how
> that should work. (Dirty tracking will be another followup)
> 
> If we look at mlx5, it will have built in dirty tracking, and when
> used with a newer IOMMUs there is also system dirty tracking
> available.
> 
> I think userspace should decide if it wants to use mlx5 built in or
> the system IOMMU to do dirty tracking.

What information does userspace use to inform such a decision?
Ultimately userspace just wants the finest granularity of tracking,
shouldn't that guide our decisions which to provide?

> Presumably the system IOMMU is turned on via
> VFIO_IOMMU_DIRTY_PAGES_FLAG_START, but what controls if the mlx5
> mechanism should be used or not?
> 
> mlx5 also has no way to return the dirty log. If the system IOMMU is
> not used then VFIO_IOMMU_DIRTY_PAGES_FLAG_START should not be done,
> however that is what controls all the logic under the two GET_BITMAP
> APIs. (even if fixed I don't really like the idea of the IOMMU
> extracting this data from the migration driver in the context of
> iommufd)
> 
> Further how does mlx5 even report that it has dirty tracking?
> 
> Was there some plan here we are missing?

I believe the intended progression of dirty tracking is that by default
all mapped ranges are dirty.  If the device supports page pinning, then
we reduce the set of dirty pages to those pages which are pinned.  A
device that doesn't otherwise need page pinning, such as a fully IOMMU
backed device, would use gratuitous page pinning triggered by the
_SAVING state activation on the device.  It sounds like mlx5 could use
this existing support today.

We had also discussed variants to page pinning that might be more
useful as device dirty page support improves.  For example calls to
mark pages dirty once rather than the perpetual dirtying of pinned
pages, calls to pin pages for read vs write, etc.  We didn't dive much
into system IOMMU dirtying, but presumably we'd have a fault handler
triggered if a page is written by the device and go from there.

> In light of all this I'm wondering if device dirty tracking should
> exist as new ioctls on the device FD and reserve the type1 code to
> only work the IOMMU dirty tracking.

Our existing model is working towards the IOMMU, ie. container,
interface aggregating dirty page context.  For example when page
pinning is used, it's only when all devices within the container are
using page pinning that we can report the pinned subset as dirty.
Otherwise userspace needs to poll each device, which I suppose enables
your idea that userspace decides which source to use, but why?  Does
the IOMMU dirty page tracking exclude devices if the user queries the
device separately?  How would it know?  What's the advantage?  It seems
like this creates too many support paths that all need to converge on
the same answer.  Consolidating DMA dirty page tracking to the DMA
mapping interface for all devices within a DMA context makes more sense
to me.  Thanks,

Alex