On Fri, 13 May 2016 03:55:09 +0000 "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote: > > From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx] > > Sent: Friday, May 13, 2016 3:06 AM > > > > > > > > > > > > Based on above thought I'm thinking whether below would work: > > > (let's use gpa to replace existing iova in type1 driver, while using iova > > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first > > > which matches existing vfio logic) > > > > > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr > > > mapping, in coarse-grained regions; > > > > > > - Leverage same page accounting/pinning logic in type1 driver, which > > > should be enough for 'pin-all' usage; > > > > > > - Then main divergence point for vGPU would be in vfio_unmap_unpin > > > and vfio_iommu_map. I'm not sure whether it's easy to fake an > > > iommu_domain for vGPU so same iommu_map/unmap can be reused. > > > > This seems troublesome. Kirti's version used numerous api-only tests > > to avoid these which made the code difficult to trace. Clearly one > > option is to split out the common code so that a new mediated-type1 > > backend skips this, but they thought they could clean it up without > > this, so we'll see what happens in the next version. > > > > > If not, we may introduce two new map/unmap callbacks provided > > > specifically by vGPU core driver, as you suggested: > > > > > > * vGPU core driver uses dma_map_page to map specified pfns: > > > > > > o When IOMMU is enabled, we'll get an iova returned different > > > from pfn; > > > o When IOMMU is disabled, returned iova is same as pfn; > > > > Either way each iova needs to be stored and we have a worst case of one > > iova per page of guest memory. > > > > > * Then vGPU core driver just maintains its own gpa<->iova lookup > > > table (e.g. called vgpu_dma) > > > > > > * Because each vfio_iommu_map invocation is about a contiguous > > > region, we can expect same number of vgpu_dma entries as maintained > > > for vfio_dma list; > > > > > > Then it's vGPU core driver's responsibility to provide gpa<->iova > > > lookup for vendor specific GPU driver. And we don't need worry about > > > tens of thousands of entries. Once we get this simple 'pin-all' model > > > ready, then it can be further extended to support 'pin-sparse' > > > scenario. We still maintain a top-level vgpu_dma list with each entry to > > > further link its own sparse mapping structure. In reality I don't expect > > > we really need to maintain per-page translation even with sparse pinning. > > > > If you're trying to equate the scale of what we need to track vs what > > type1 currently tracks, they're significantly different. Possible > > things we need to track include the pfn, the iova, and possibly a > > reference count or some sort of pinned page map. In the pin-all model > > we can assume that every page is pinned on map and unpinned on unmap, > > so a reference count or map is unnecessary. We can also assume that we > > can always regenerate the pfn with get_user_pages() from the vaddr, so > > we don't need to track that. I don't see any way around tracking the > > iova. The iommu can't tell us this like it can with the normal type1 > > model because the pfn is the result of the translation, not the key for > > the translation. So we're always going to have between 1 and > > (size/PAGE_SIZE) iova entries per vgpu_dma entry. You might be able to > > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some > > data structure tracking every iova. > > There is one option. We may use alloc_iova to reserve continuous iova > range for each vgpu_dma range and then use iommu_map/unmap to > write iommu ptes later upon map request (then could be same #entries > as vfio_dma compared to unbounded entries when using dma_map_page). > Of course this needs to be done in vGPU core driver, since vfio type1 only > sees a faked iommu domain. I'm not sure this is really how iova domains work. There's only one iova domain per iommu domain using the dma-iommu API, and iommu_map/unmap are part of a different API. iova domain may be an interesting solution though. > > Sparse mapping has the same issue but of course the tree of iovas is > > potentially incomplete and we need a way to determine where it's > > incomplete. A page table rooted in the vgpu_dma and indexed by the > > offset from the start vaddr seems like the way to go here. It's also > > possible that some mediated device models might store the iova in the > > command sent to the device and therefore be able to parse those entries > > back out to unmap them without storing them separately. This might be > > how the s390 channel-io model would prefer to work. That seems like > > further validation that such tracking is going to be dependent on the > > mediated driver itself and probably not something to centralize in a > > mediated iommu driver. Thanks, > > > > Another simpler way might be allocate an array for each memory > regions registered from user space. For a 512MB region, it means > 512K*4=2MB array to track pfn or iova mapping corresponding to > a gfn. It may consume more resource than rb tree when not many > pages need to be pinned, but could be less when rb tree increases > a lot. An array is only the most space efficient structure for a fully pinned area where we have no contiguous iova. If we're either mapping a larger hugepage or we have a larger continuous iova space due to scatter-gather mapping or we're sparsely pinning the region, an array can waste a lot of space. A 512MB is also a pretty anemic example, 2MB is a reasonable over head, but 2MB per 512MB looks pretty bad when we have a 512GB VM. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html