Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 11 May 2016 16:06:28 -0600

On Wed, 11 May 2016 17:15:15 +0800
Jike Song <jike.song@xxxxxxxxx> wrote:

> On 05/11/2016 12:02 AM, Neo Jia wrote:
> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> >>>> From: Song, Jike
> >>>>
> >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >>>> hardware. It just, as you said in another mail, "rather than
> >>>> programming them into an IOMMU for a device, it simply stores the
> >>>> translations for use by later requests".
> >>>>
> >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >>>> translations without any knowledge about hardware IOMMU, how is the
> >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >>>> by the IOMMU backend here)?
> >>>>
> >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> >>>> device model to figure out:
> >>>>
> >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >>>> 	2, for which page to call dma_unmap_page?
> >>>>
> >>>> --  
> >>>
> >>> We have to support both w/ iommu and w/o iommu case, since
> >>> that fact is out of GPU driver control. A simple way is to use
> >>> dma_map_page which internally will cope with w/ and w/o iommu
> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >>> Then in this file we only need to cache GPA to whatever dmadr_t
> >>> returned by dma_map_page.
> >>>  
> >>
> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > 
> > Hi Jike,
> > 
> > With mediated passthru, you still can use hardware iommu, but more important
> > that part is actually orthogonal to what we are discussing here as we will only
> > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we 
> > have pinned pages later with the help of above info, you can map it into the
> > proper iommu domain if the system has configured so.
> >  
> 
> Hi Neo,
> 
> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> but to find out whether a pfn was previously mapped or not, you have to
> track it with another rbtree-alike data structure (the IOMMU driver simply
> doesn't bother with tracking), that seems somehow duplicate with the vGPU
> IOMMU backend we are discussing here.
> 
> And it is also semantically correct for an IOMMU backend to handle both w/
> and w/o an IOMMU hardware? :)

A problem with the iommu doing the dma_map_page() though is for what
device does it do this?  In the mediated case the vfio infrastructure
is dealing with a software representation of a device.  For all we
know that software model could transparently migrate from one physical
GPU to another.  There may not even be a physical device backing
the mediated device.  Those are details left to the vgpu driver itself.

Perhaps one possibility would be to allow the vgpu driver to register
map and unmap callbacks.  The unmap callback might provide the
invalidation interface that we're so far missing.  The combination of
map and unmap callbacks might simplify the Intel approach of pinning the
entire VM memory space, ie. for each map callback do a translation
(pin) and dma_map_page, for each unmap do a dma_unmap_page and release
the translation.  There's still the problem of where that dma_addr_t
from the dma_map_page is stored though.  Someone would need to keep
track of iova to dma_addr_t.  The vfio iommu might be a place to do
that since we're already tracking information based on iova, possibly
in an opaque data element provided by the vgpu driver.  However, we're
going to need to take a serious look at whether an rb-tree is the right
data structure for the job.  It works well for the current type1
functionality where we typically have tens of entries.  I think the
NVIDIA model of sparse pinning the VM is pushing that up to tens of
thousands.  If Intel intends to pin the entire guest, that's
potentially tens of millions of tracked entries and I don't know that
an rb-tree is the right tool for that job.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html