Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices

Jason Gunthorpe <jgg@xxxxxxxxxx> · Tue, 16 Nov 2021 15:25:05 -0400

On Tue, Nov 16, 2021 at 10:57:36AM -0700, Alex Williamson wrote:

> > I think userspace should decide if it wants to use mlx5 built in or
> > the system IOMMU to do dirty tracking.
> 
> What information does userspace use to inform such a decision?

Kernel can't know which approach performs better. Operators should
benchmark and make a choice for their deployment HW. Maybe device
tracking severely impacts device performance or vice versa.

Kernel doesn't easily know what userspace has done, maybe one device
supports migration driver dirty tracking and one device does not.

Is user space going to use a system IOMMU for both devices? 

Is it going to put the simple device in NDMA early and continue to
dirty track to shutdown the other devices?

> Ultimately userspace just wants the finest granularity of tracking,
> shouldn't that guide our decisions which to provide?

At least for mlx5 there is going to some trade off curve of device
performance, dirty tracking page size, and working set.

Even lower is better is not necessarily true. After overheads on a
400GB RDMA NIC there is not such a big difference between doing a 4k
and 16k scatter transfer. The CPU work to process all the extra bitmap
data may not be a net win compared to block transfer times.

Conversly someone doing 1G TCP transfers probably cares a lot to
minimize block size.

Overall, I think there is far too much up in the air and unmeasured to
firmly commit the kernel to a fixed policy.

So, I would like to see userspace control most of the policy aspects,
including the dirty track provider.

> I believe the intended progression of dirty tracking is that by default
> all mapped ranges are dirty.  If the device supports page pinning, then
> we reduce the set of dirty pages to those pages which are pinned.  A
> device that doesn't otherwise need page pinning, such as a fully IOMMU

How does userspace know if dirty tracking works or not? All I see
VFIO_IOMMU_DIRTY_PAGES_FLAG_START unconditionally allocs some bitmaps.

I'm surprised it doesn't check that only NO_IOMMU's devices are
attached to the container and refuse to dirty track otherwise - since
it doesn't work..

> backed device, would use gratuitous page pinning triggered by the
> _SAVING state activation on the device.  It sounds like mlx5 could use
> this existing support today.

How does mlx5 know if it should turn on its dirty page tracking on
SAVING or if the system IOMMU covers it? Or for some reason userspace
doesn't want dirty tracking but is doing pre-copy?

When we mix dirty track with pre-copy, the progression seems to be:

  DITRY TRACKING | RUNNING
     Copy every page to the remote
  DT | SAVING | RUNNING
     Copy pre-copy migration data to the remote
  SAVING | NDMA | RUNNING
     Read and clear dirty track device bitmap
  DT | SAVING | RUNNING
     Copy new dirtied data
     (maybe loop back to NDMA a few times?)
  SAVING | NDMA | RUNNING
     P2P grace state
  0
    Read the dirty track and copy data
    Read and send the migration state

Can we do something so complex using only SAVING?

.. and along the lines of the above how do we mix in NDMA to the iommu
container, and how does it work if only some devices support NDMA?

> We had also discussed variants to page pinning that might be more
> useful as device dirty page support improves.  For example calls to
> mark pages dirty once rather than the perpetual dirtying of pinned
> pages, calls to pin pages for read vs write, etc.  We didn't dive much
> into system IOMMU dirtying, but presumably we'd have a fault handler
> triggered if a page is written by the device and go from there.

Would be interesting to know for sure what current IOMMU HW has
done. I'm supposing the easiest implementation is to write a dirty bit
to the IO PTE the same as the CPU writes a dirty bit the normal PTE.

> > In light of all this I'm wondering if device dirty tracking should
> > exist as new ioctls on the device FD and reserve the type1 code to
> > only work the IOMMU dirty tracking.
> 
> Our existing model is working towards the IOMMU, ie. container,
> interface aggregating dirty page context.  

This creates inefficiencies in the kernel, we copy from the mlx5
formed data structure to new memory in the iommu through a very
ineffficent API and then again we do an ioctl to copy it once more and
throw all the extra work away. It does not seem good for something
where we want performance.

> For example when page pinning is used, it's only when all devices
> within the container are using page pinning that we can report the
> pinned subset as dirty.  Otherwise userspace needs to poll each
> device, which I suppose enables your idea that userspace decides
> which source to use, but why?

Efficiency, and user selectable policy.

Userspace can just allocate an all zeros bitmap and feed it to each of
the providers in the kernel using a 'or in your dirty' semantic.

No redundant kernel data marshaling, userspace gets to decide which
tracking provider to use, and it is simple to implement in the kernel.

Userspace has to do this anyhow if it has configurations with multiple
containers. For instance because it was forced to split the containers
due to one device not supporting NDMA.

> Does the IOMMU dirty page tracking exclude devices if the user
> queries the device separately?  

What makes sense to me is multiple tracking providers. Each can be
turned on and off.

If the container tracking provider says it supports tracking then it
means it can track DMA from every device it is connected to (unlike
today?). eg by using IOMMU HW that naturally does this, or by only
having only NO_IOMMU devices.

If the migration driver says it supports tracking, then it only tracks
DMA from that device.

> How would it know?  What's the advantage?  It seems like this
> creates too many support paths that all need to converge on the same
> answer.  Consolidating DMA dirty page tracking to the DMA mapping
> interface for all devices within a DMA context makes more sense to
> me.

What I see is a lot of questions and limitations with this
approach. If we stick to funneling everything through the iommu then
answering the questions seem to create a large amount of kernel
work. Enough to ask if it is worthwhile..

.. and then we have to ask how does this all work in IOMMUFD where it
is not so reasonable to tightly couple the migration driver and the
IOAS and I get more questions :)

Jason