Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

Neo Jia <cjia@xxxxxxxxxx> · Fri, 13 May 2016 02:05:01 -0700

On Fri, May 13, 2016 at 04:39:37PM +0800, Dong Jia wrote:
> On Fri, 13 May 2016 00:24:34 -0700
> Neo Jia <cjia@xxxxxxxxxx> wrote:
> 
> > On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> > > On Thu, 12 May 2016 13:05:52 -0600
> > > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
> > > 
> > > > On Thu, 12 May 2016 08:00:36 +0000
> > > > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > > > 
> > > > > > From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
> > > > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > > > 
> > > > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > > > Jike Song <jike.song@xxxxxxxxx> wrote:
> > > > > >   
> > > > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > > > > >>>> From: Song, Jike
> > > > > > > >>>>
> > > > > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > > > > >>>> translations for use by later requests".
> > > > > > > >>>>
> > > > > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > > > > >>>> by the IOMMU backend here)?
> > > > > > > >>>>
> > > > > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > > > > >>>> device model to figure out:
> > > > > > > >>>>
> > > > > > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > > > > >>>> 	2, for which page to call dma_unmap_page?
> > > > > > > >>>>
> > > > > > > >>>> --  
> > > > > > > >>>
> > > > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > > > > >>> returned by dma_map_page.
> > > > > > > >>>  
> > > > > > > >>
> > > > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > > > > > >
> > > > > > > > Hi Jike,
> > > > > > > >
> > > > > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > > > > proper iommu domain if the system has configured so.
> > > > > > > >  
> > > > > > >
> > > > > > > Hi Neo,
> > > > > > >
> > > > > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > > > > IOMMU backend we are discussing here.
> > > > > > >
> > > > > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > > > > and w/o an IOMMU hardware? :)  
> > > > > > 
> > > > > > A problem with the iommu doing the dma_map_page() though is for what
> > > > > > device does it do this?  In the mediated case the vfio infrastructure
> > > > > > is dealing with a software representation of a device.  For all we
> > > > > > know that software model could transparently migrate from one physical
> > > > > > GPU to another.  There may not even be a physical device backing
> > > > > > the mediated device.  Those are details left to the vgpu driver itself.  
> > > > > 
> > > > > This is a fair argument. VFIO iommu driver simply serves user space
> > > > > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > > > > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > > > > 
> > > > > > 
> > > > > > Perhaps one possibility would be to allow the vgpu driver to register
> > > > > > map and unmap callbacks.  The unmap callback might provide the
> > > > > > invalidation interface that we're so far missing.  The combination of
> > > > > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > > > > entire VM memory space, ie. for each map callback do a translation
> > > > > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > > > > the translation.  There's still the problem of where that dma_addr_t
> > > > > > from the dma_map_page is stored though.  Someone would need to keep
> > > > > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > > > > that since we're already tracking information based on iova, possibly
> > > > > > in an opaque data element provided by the vgpu driver.  However, we're
> > > > > > going to need to take a serious look at whether an rb-tree is the right
> > > > > > data structure for the job.  It works well for the current type1
> > > > > > functionality where we typically have tens of entries.  I think the
> > > > > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > > > > thousands.  If Intel intends to pin the entire guest, that's
> > > > > > potentially tens of millions of tracked entries and I don't know that
> > > > > > an rb-tree is the right tool for that job.  Thanks,
> > > > > >   
> > > > > 
> > > > > Based on above thought I'm thinking whether below would work:
> > > > > (let's use gpa to replace existing iova in type1 driver, while using iova
> > > > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > > > > which matches existing vfio logic)
> > > > > 
> > > > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > > > > mapping, in coarse-grained regions;
> > > > > 
> > > > > - Leverage same page accounting/pinning logic in type1 driver, which 
> > > > > should be enough for 'pin-all' usage;
> > > > > 
> > > > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > > > and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> > > > > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> > > > 
> > > > This seems troublesome.  Kirti's version used numerous api-only tests
> > > > to avoid these which made the code difficult to trace.  Clearly one
> > > > option is to split out the common code so that a new mediated-type1
> > > > backend skips this, but they thought they could clean it up without
> > > > this, so we'll see what happens in the next version.
> > > > 
> > > > > If not, we may introduce two new map/unmap callbacks provided
> > > > > specifically by vGPU core driver, as you suggested:
> > > > > 
> > > > > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > > > > 
> > > > > 		o When IOMMU is enabled, we'll get an iova returned different
> > > > > from pfn;
> > > > > 		o When IOMMU is disabled, returned iova is same as pfn;
> > > > 
> > > > Either way each iova needs to be stored and we have a worst case of one
> > > > iova per page of guest memory.
> > > > 
> > > > > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > > > > table (e.g. called vgpu_dma)
> > > > > 
> > > > > 	* Because each vfio_iommu_map invocation is about a contiguous 
> > > > > region, we can expect same number of vgpu_dma entries as maintained 
> > > > > for vfio_dma list;
> > > > >
> > > > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > > > lookup for vendor specific GPU driver. And we don't need worry about
> > > > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > > > ready, then it can be further extended to support 'pin-sparse'
> > > > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > > > further link its own sparse mapping structure. In reality I don't expect
> > > > > we really need to maintain per-page translation even with sparse pinning.
> > > > 
> > > > If you're trying to equate the scale of what we need to track vs what
> > > > type1 currently tracks, they're significantly different.  Possible
> > > > things we need to track include the pfn, the iova, and possibly a
> > > > reference count or some sort of pinned page map.  In the pin-all model
> > > > we can assume that every page is pinned on map and unpinned on unmap,
> > > > so a reference count or map is unnecessary.  We can also assume that we
> > > > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > > > we don't need to track that.  I don't see any way around tracking the
> > > > iova.  The iommu can't tell us this like it can with the normal type1
> > > > model because the pfn is the result of the translation, not the key for
> > > > the translation. So we're always going to have between 1 and
> > > > (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> > > > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > > > data structure tracking every iova.
> > > > 
> > > > Sparse mapping has the same issue but of course the tree of iovas is
> > > > potentially incomplete and we need a way to determine where it's
> > > > incomplete.  A page table rooted in the vgpu_dma and indexed by the
> > > > offset from the start vaddr seems like the way to go here.  It's also
> > > > possible that some mediated device models might store the iova in the
> > > > command sent to the device and therefore be able to parse those entries
> > > > back out to unmap them without storing them separately.  This might be
> > > > how the s390 channel-io model would prefer to work.
> > > Dear Alex:
> > > 
> > > For the s390 channel-io model, when an I/O instruction was intercepted
> > > and issued to the device driver for further translation, the operand of
> > > the instruction contents iovas only. Since iova is the key to locate an
> > > entry in the database (r-b tree or whatever), we do can parse the
> > > entries back out one by one when doing the unmap operation.
> > >                  ^^^^^^^^^^
> > > 
> > > BTW, if the mediated-iommu backend can potentially offer a transaction
> > > level support for the unmap operation, I believe it will benefit the
> > > performance for this case.
> > > 
> > > e.g.:
> > > handler = vfio_trasaction_begin();
> > > foreach(iova in the command) {
> > >     pfn = vfio_trasaction_map(handler, iova);
> > >     do_io(pfn);
> > > }
> > 
> > Hi Dong,
> > 
> > Could you please help me understand the performance benefit here? 
> > 
> > Is the perf argument coming from the searching of rbtree of the tracking data
> > structure or something else?
> > 
> > For example you can do similar thing by the following sequence from your backend
> > driver:
> > 
> >     vfio_pin_pages(gfn_list/iova_list /* in */, npages, prot, pfn_bases /* out */)
> >     foreach (pfn)
> >         do_io(pfn)
> >     vfio_unpin_pages(pfn_bases)
> Dear Neo:
> 
> FWIU, the channel-io driver could leverage these interfaces without
> obvious feasibility issues. Since the implementation of the current
> vfio_unpin_pages iterates @pfn_bases and find the corresponding entry
> from the rb tree for each of the pfn_base, I'm wondering if a dedicated
> list of the entries for the whole @pfn_bases could offer us some
> benefits. I have to say that I know it's too early to consider the perf
> , and the current interfaces are fine for the channel-io case.

Hi Dong,

We should definitely be mindful about the data structure performance especially
dealing with kernel. But for now, we haven't done any performance analysis yet
for the current rbtree implementation, later we will definitely run it through
large guest RAM configuration and multiple virtual devices cases, etc. to
collect data.

Regarding your use case, may I ask if there will be concurrent command streams
running for the same VM? If yes, those two transaction requests (if we
implement) will compete not only the rbtree lock but also the GUP locks.

Also, what is the typical guest RAM we are talking about here for your usecase
and any rough estimation of the active working set of those DMA pages? 

> 
> I'm also wondering if such an idea could contribute a little to your
> discussion of the management of the key-value mapping issue. If this is
> just a noise (sorry for that :<), please ignore it.
> 
> My major intention is to show up, and to elaborate a bit about the
> channel-io use case. So you will see that, there is really another user
> of the mediate-iommu backend, and as Alex mentioned before, getting rid
> of the 'vgpu_dev' and other vgpu specific stuff is indeed necessary. :>

Definitely, we are changing the module/variable names to reflect this general
purpose already. 

Thanks,
Neo

> 
> > 
> > Thanks,
> > Neo
> > 
> > > 
> > > /*
> > >  * Expect to unmap all of the pfns mapped in this trasaction with the
> > >  * next statement. The mediated-iommu backend could use handler as the
> > >  * key to track the list of the entries.
> > >  */
> > > vfio_trasaction_unmap(handler);
> > > vfio_trasaction_end(handler);
> > > 
> > > Not sure if this could benefit the vgpu sparse mapping use case though.
> > 
> > 
> > 
> > 
> > 
> > > 
> > > >  That seems like
> > > > further validation that such tracking is going to be dependent on the
> > > > mediated driver itself and probably not something to centralize in a
> > > > mediated iommu driver.  Thanks,
> > > > 
> > > > Alex
> > > > 
> > > 
> > > 
> > > 
> > > --------
> > > Dong Jia
> > > 
> > 
> 
> 
> 
> --------
> Dong Jia
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html