RE: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Fri, 13 May 2016 03:55:09 +0000

> From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
> Sent: Friday, May 13, 2016 3:06 AM
> 
> > >
> >
> > Based on above thought I'm thinking whether below would work:
> > (let's use gpa to replace existing iova in type1 driver, while using iova
> > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > which matches existing vfio logic)
> >
> > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > mapping, in coarse-grained regions;
> >
> > - Leverage same page accounting/pinning logic in type1 driver, which
> > should be enough for 'pin-all' usage;
> >
> > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > and vfio_iommu_map. I'm not sure whether it's easy to fake an
> > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> 
> This seems troublesome.  Kirti's version used numerous api-only tests
> to avoid these which made the code difficult to trace.  Clearly one
> option is to split out the common code so that a new mediated-type1
> backend skips this, but they thought they could clean it up without
> this, so we'll see what happens in the next version.
> 
> > If not, we may introduce two new map/unmap callbacks provided
> > specifically by vGPU core driver, as you suggested:
> >
> > 	* vGPU core driver uses dma_map_page to map specified pfns:
> >
> > 		o When IOMMU is enabled, we'll get an iova returned different
> > from pfn;
> > 		o When IOMMU is disabled, returned iova is same as pfn;
> 
> Either way each iova needs to be stored and we have a worst case of one
> iova per page of guest memory.
> 
> > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > table (e.g. called vgpu_dma)
> >
> > 	* Because each vfio_iommu_map invocation is about a contiguous
> > region, we can expect same number of vgpu_dma entries as maintained
> > for vfio_dma list;
> >
> > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > lookup for vendor specific GPU driver. And we don't need worry about
> > tens of thousands of entries. Once we get this simple 'pin-all' model
> > ready, then it can be further extended to support 'pin-sparse'
> > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > further link its own sparse mapping structure. In reality I don't expect
> > we really need to maintain per-page translation even with sparse pinning.
> 
> If you're trying to equate the scale of what we need to track vs what
> type1 currently tracks, they're significantly different.  Possible
> things we need to track include the pfn, the iova, and possibly a
> reference count or some sort of pinned page map.  In the pin-all model
> we can assume that every page is pinned on map and unpinned on unmap,
> so a reference count or map is unnecessary.  We can also assume that we
> can always regenerate the pfn with get_user_pages() from the vaddr, so
> we don't need to track that.  I don't see any way around tracking the
> iova.  The iommu can't tell us this like it can with the normal type1
> model because the pfn is the result of the translation, not the key for
> the translation. So we're always going to have between 1 and
> (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> data structure tracking every iova.

There is one option. We may use alloc_iova to reserve continuous iova
range for each vgpu_dma range and then use iommu_map/unmap to
write iommu ptes later upon map request (then could be same #entries
as vfio_dma compared to unbounded entries when using dma_map_page). 
Of course this needs to be done in vGPU core driver, since vfio type1 only 
sees a faked iommu domain.

> 
> Sparse mapping has the same issue but of course the tree of iovas is
> potentially incomplete and we need a way to determine where it's
> incomplete.  A page table rooted in the vgpu_dma and indexed by the
> offset from the start vaddr seems like the way to go here.  It's also
> possible that some mediated device models might store the iova in the
> command sent to the device and therefore be able to parse those entries
> back out to unmap them without storing them separately.  This might be
> how the s390 channel-io model would prefer to work.  That seems like
> further validation that such tracking is going to be dependent on the
> mediated driver itself and probably not something to centralize in a
> mediated iommu driver.  Thanks,
> 

Another simpler way might be allocate an array for each memory
regions registered from user space. For a 512MB region, it means
512K*4=2MB array to track pfn or iova mapping corresponding to
a gfn. It may consume more resource than rb tree when not many
pages need to be pinned, but could be less when rb tree increases
a lot. 

Is such array-based approach considered ugly in kernel? :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html