On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote: > On Thu, 12 May 2016 08:00:36 +0000 > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote: > > > > From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx] > > > Sent: Thursday, May 12, 2016 6:06 AM > > > > > > On Wed, 11 May 2016 17:15:15 +0800 > > > Jike Song <jike.song@xxxxxxxxx> wrote: > > > > > > > On 05/11/2016 12:02 AM, Neo Jia wrote: > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote: > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote: > > > > >>>> From: Song, Jike > > > > >>>> > > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU > > > > >>>> hardware. It just, as you said in another mail, "rather than > > > > >>>> programming them into an IOMMU for a device, it simply stores the > > > > >>>> translations for use by later requests". > > > > >>>> > > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled. > > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs > > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page; > > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA > > > > >>>> translations without any knowledge about hardware IOMMU, how is the > > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA > > > > >>>> by the IOMMU backend here)? > > > > >>>> > > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it > > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the > > > > >>>> device model to figure out: > > > > >>>> > > > > >>>> 1, for a given GPA, how to avoid calling dma_map_page multiple times? > > > > >>>> 2, for which page to call dma_unmap_page? > > > > >>>> > > > > >>>> -- > > > > >>> > > > > >>> We have to support both w/ iommu and w/o iommu case, since > > > > >>> that fact is out of GPU driver control. A simple way is to use > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu. > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t > > > > >>> returned by dma_map_page. > > > > >>> > > > > >> > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here? > > > > > > > > > > Hi Jike, > > > > > > > > > > With mediated passthru, you still can use hardware iommu, but more important > > > > > that part is actually orthogonal to what we are discussing here as we will only > > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we > > > > > have pinned pages later with the help of above info, you can map it into the > > > > > proper iommu domain if the system has configured so. > > > > > > > > > > > > > Hi Neo, > > > > > > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere, > > > > but to find out whether a pfn was previously mapped or not, you have to > > > > track it with another rbtree-alike data structure (the IOMMU driver simply > > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU > > > > IOMMU backend we are discussing here. > > > > > > > > And it is also semantically correct for an IOMMU backend to handle both w/ > > > > and w/o an IOMMU hardware? :) > > > > > > A problem with the iommu doing the dma_map_page() though is for what > > > device does it do this? In the mediated case the vfio infrastructure > > > is dealing with a software representation of a device. For all we > > > know that software model could transparently migrate from one physical > > > GPU to another. There may not even be a physical device backing > > > the mediated device. Those are details left to the vgpu driver itself. > > > > This is a fair argument. VFIO iommu driver simply serves user space > > requests, where only vaddr<->iova (essentially gpa in kvm case) is > > mattered. How iova is mapped into real IOMMU is not VFIO's interest. > > > > > > > > Perhaps one possibility would be to allow the vgpu driver to register > > > map and unmap callbacks. The unmap callback might provide the > > > invalidation interface that we're so far missing. The combination of > > > map and unmap callbacks might simplify the Intel approach of pinning the > > > entire VM memory space, ie. for each map callback do a translation > > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release > > > the translation. There's still the problem of where that dma_addr_t > > > from the dma_map_page is stored though. Someone would need to keep > > > track of iova to dma_addr_t. The vfio iommu might be a place to do > > > that since we're already tracking information based on iova, possibly > > > in an opaque data element provided by the vgpu driver. However, we're > > > going to need to take a serious look at whether an rb-tree is the right > > > data structure for the job. It works well for the current type1 > > > functionality where we typically have tens of entries. I think the > > > NVIDIA model of sparse pinning the VM is pushing that up to tens of > > > thousands. If Intel intends to pin the entire guest, that's > > > potentially tens of millions of tracked entries and I don't know that > > > an rb-tree is the right tool for that job. Thanks, > > > > > > > Based on above thought I'm thinking whether below would work: > > (let's use gpa to replace existing iova in type1 driver, while using iova > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first > > which matches existing vfio logic) > > > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr > > mapping, in coarse-grained regions; > > > > - Leverage same page accounting/pinning logic in type1 driver, which > > should be enough for 'pin-all' usage; > > > > - Then main divergence point for vGPU would be in vfio_unmap_unpin > > and vfio_iommu_map. I'm not sure whether it's easy to fake an > > iommu_domain for vGPU so same iommu_map/unmap can be reused. > > This seems troublesome. Kirti's version used numerous api-only tests > to avoid these which made the code difficult to trace. Clearly one > option is to split out the common code so that a new mediated-type1 > backend skips this, but they thought they could clean it up without > this, so we'll see what happens in the next version. > > > If not, we may introduce two new map/unmap callbacks provided > > specifically by vGPU core driver, as you suggested: > > > > * vGPU core driver uses dma_map_page to map specified pfns: > > > > o When IOMMU is enabled, we'll get an iova returned different > > from pfn; > > o When IOMMU is disabled, returned iova is same as pfn; > > Either way each iova needs to be stored and we have a worst case of one > iova per page of guest memory. > > > * Then vGPU core driver just maintains its own gpa<->iova lookup > > table (e.g. called vgpu_dma) > > > > * Because each vfio_iommu_map invocation is about a contiguous > > region, we can expect same number of vgpu_dma entries as maintained > > for vfio_dma list; > > > > Then it's vGPU core driver's responsibility to provide gpa<->iova > > lookup for vendor specific GPU driver. And we don't need worry about > > tens of thousands of entries. Once we get this simple 'pin-all' model > > ready, then it can be further extended to support 'pin-sparse' > > scenario. We still maintain a top-level vgpu_dma list with each entry to > > further link its own sparse mapping structure. In reality I don't expect > > we really need to maintain per-page translation even with sparse pinning. > > If you're trying to equate the scale of what we need to track vs what > type1 currently tracks, they're significantly different. Possible > things we need to track include the pfn, the iova, and possibly a > reference count or some sort of pinned page map. In the pin-all model > we can assume that every page is pinned on map and unpinned on unmap, > so a reference count or map is unnecessary. We can also assume that we > can always regenerate the pfn with get_user_pages() from the vaddr, so > we don't need to track that. Hi Alex, Thanks for pointing this out, we will not track those in our next rev and get_user_pages will be used from the vaddr as you suggested to handle the single VM with both passthru + mediated device case. Thanks, Neo > I don't see any way around tracking the > iova. The iommu can't tell us this like it can with the normal type1 > model because the pfn is the result of the translation, not the key for > the translation. So we're always going to have between 1 and > (size/PAGE_SIZE) iova entries per vgpu_dma entry. You might be able to > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some > data structure tracking every iova. > > Sparse mapping has the same issue but of course the tree of iovas is > potentially incomplete and we need a way to determine where it's > incomplete. A page table rooted in the vgpu_dma and indexed by the > offset from the start vaddr seems like the way to go here. It's also > possible that some mediated device models might store the iova in the > command sent to the device and therefore be able to parse those entries > back out to unmap them without storing them separately. This might be > how the s390 channel-io model would prefer to work. That seems like > further validation that such tracking is going to be dependent on the > mediated driver itself and probably not something to centralize in a > mediated iommu driver. Thanks, > > Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html