Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 12 May 2016 13:05:52 -0600

On Thu, 12 May 2016 08:00:36 +0000
"Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:

> > From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
> > Sent: Thursday, May 12, 2016 6:06 AM
> > 
> > On Wed, 11 May 2016 17:15:15 +0800
> > Jike Song <jike.song@xxxxxxxxx> wrote:
> >   
> > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > >>>> From: Song, Jike
> > > >>>>
> > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > >>>> hardware. It just, as you said in another mail, "rather than
> > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > >>>> translations for use by later requests".
> > > >>>>
> > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > >>>> by the IOMMU backend here)?
> > > >>>>
> > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > >>>> device model to figure out:
> > > >>>>
> > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > >>>> 	2, for which page to call dma_unmap_page?
> > > >>>>
> > > >>>> --  
> > > >>>
> > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > >>> that fact is out of GPU driver control. A simple way is to use
> > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > >>> returned by dma_map_page.
> > > >>>  
> > > >>
> > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > >
> > > > Hi Jike,
> > > >
> > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > have pinned pages later with the help of above info, you can map it into the
> > > > proper iommu domain if the system has configured so.
> > > >  
> > >
> > > Hi Neo,
> > >
> > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > but to find out whether a pfn was previously mapped or not, you have to
> > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > IOMMU backend we are discussing here.
> > >
> > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > and w/o an IOMMU hardware? :)  
> > 
> > A problem with the iommu doing the dma_map_page() though is for what
> > device does it do this?  In the mediated case the vfio infrastructure
> > is dealing with a software representation of a device.  For all we
> > know that software model could transparently migrate from one physical
> > GPU to another.  There may not even be a physical device backing
> > the mediated device.  Those are details left to the vgpu driver itself.  
> 
> This is a fair argument. VFIO iommu driver simply serves user space
> requests, where only vaddr<->iova (essentially gpa in kvm case) is
> mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> 
> > 
> > Perhaps one possibility would be to allow the vgpu driver to register
> > map and unmap callbacks.  The unmap callback might provide the
> > invalidation interface that we're so far missing.  The combination of
> > map and unmap callbacks might simplify the Intel approach of pinning the
> > entire VM memory space, ie. for each map callback do a translation
> > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > the translation.  There's still the problem of where that dma_addr_t
> > from the dma_map_page is stored though.  Someone would need to keep
> > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > that since we're already tracking information based on iova, possibly
> > in an opaque data element provided by the vgpu driver.  However, we're
> > going to need to take a serious look at whether an rb-tree is the right
> > data structure for the job.  It works well for the current type1
> > functionality where we typically have tens of entries.  I think the
> > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > thousands.  If Intel intends to pin the entire guest, that's
> > potentially tens of millions of tracked entries and I don't know that
> > an rb-tree is the right tool for that job.  Thanks,
> >   
> 
> Based on above thought I'm thinking whether below would work:
> (let's use gpa to replace existing iova in type1 driver, while using iova
> for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> which matches existing vfio logic)
> 
> - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> mapping, in coarse-grained regions;
> 
> - Leverage same page accounting/pinning logic in type1 driver, which 
> should be enough for 'pin-all' usage;
> 
> - Then main divergence point for vGPU would be in vfio_unmap_unpin
> and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> iommu_domain for vGPU so same iommu_map/unmap can be reused.

This seems troublesome.  Kirti's version used numerous api-only tests
to avoid these which made the code difficult to trace.  Clearly one
option is to split out the common code so that a new mediated-type1
backend skips this, but they thought they could clean it up without
this, so we'll see what happens in the next version.

> If not, we may introduce two new map/unmap callbacks provided
> specifically by vGPU core driver, as you suggested:
> 
> 	* vGPU core driver uses dma_map_page to map specified pfns:
> 
> 		o When IOMMU is enabled, we'll get an iova returned different
> from pfn;
> 		o When IOMMU is disabled, returned iova is same as pfn;

Either way each iova needs to be stored and we have a worst case of one
iova per page of guest memory.

> 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> table (e.g. called vgpu_dma)
> 
> 	* Because each vfio_iommu_map invocation is about a contiguous 
> region, we can expect same number of vgpu_dma entries as maintained 
> for vfio_dma list;
>
> Then it's vGPU core driver's responsibility to provide gpa<->iova
> lookup for vendor specific GPU driver. And we don't need worry about
> tens of thousands of entries. Once we get this simple 'pin-all' model
> ready, then it can be further extended to support 'pin-sparse'
> scenario. We still maintain a top-level vgpu_dma list with each entry to
> further link its own sparse mapping structure. In reality I don't expect
> we really need to maintain per-page translation even with sparse pinning.

If you're trying to equate the scale of what we need to track vs what
type1 currently tracks, they're significantly different.  Possible
things we need to track include the pfn, the iova, and possibly a
reference count or some sort of pinned page map.  In the pin-all model
we can assume that every page is pinned on map and unpinned on unmap,
so a reference count or map is unnecessary.  We can also assume that we
can always regenerate the pfn with get_user_pages() from the vaddr, so
we don't need to track that.  I don't see any way around tracking the
iova.  The iommu can't tell us this like it can with the normal type1
model because the pfn is the result of the translation, not the key for
the translation. So we're always going to have between 1 and
(size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
data structure tracking every iova.

Sparse mapping has the same issue but of course the tree of iovas is
potentially incomplete and we need a way to determine where it's
incomplete.  A page table rooted in the vgpu_dma and indexed by the
offset from the start vaddr seems like the way to go here.  It's also
possible that some mediated device models might store the iova in the
command sent to the device and therefore be able to parse those entries
back out to unmap them without storing them separately.  This might be
how the s390 channel-io model would prefer to work.  That seems like
further validation that such tracking is going to be dependent on the
mediated driver itself and probably not something to centralize in a
mediated iommu driver.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html