On Fri, May 13, 2016 at 02:08:36PM +0800, Jike Song wrote: > On 05/13/2016 03:49 AM, Neo Jia wrote: > > On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote: > >> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson > >> <alex.williamson@xxxxxxxxxx> wrote: > >>> On Wed, 11 May 2016 17:15:15 +0800 > >>> Jike Song <jike.song@xxxxxxxxx> wrote: > >>> > >>>> On 05/11/2016 12:02 AM, Neo Jia wrote: > >>>>> On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote: > >>>>>> On 05/05/2016 05:27 PM, Tian, Kevin wrote: > >>>>>>>> From: Song, Jike > >>>>>>>> > >>>>>>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU > >>>>>>>> hardware. It just, as you said in another mail, "rather than > >>>>>>>> programming them into an IOMMU for a device, it simply stores the > >>>>>>>> translations for use by later requests". > >>>>>>>> > >>>>>>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled. > >>>>>>>> Otherwise, if IOMMU is present, the gfx driver eventually programs > >>>>>>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page; > >>>>>>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA > >>>>>>>> translations without any knowledge about hardware IOMMU, how is the > >>>>>>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA > >>>>>>>> by the IOMMU backend here)? > >>>>>>>> > >>>>>>>> If things go as guessed above, as vfio_pin_pages() indicates, it > >>>>>>>> pin & translate vaddr to PFN, then it will be very difficult for the > >>>>>>>> device model to figure out: > >>>>>>>> > >>>>>>>> 1, for a given GPA, how to avoid calling dma_map_page multiple times? > >>>>>>>> 2, for which page to call dma_unmap_page? > >>>>>>>> > >>>>>>>> -- > >>>>>>> > >>>>>>> We have to support both w/ iommu and w/o iommu case, since > >>>>>>> that fact is out of GPU driver control. A simple way is to use > >>>>>>> dma_map_page which internally will cope with w/ and w/o iommu > >>>>>>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu. > >>>>>>> Then in this file we only need to cache GPA to whatever dmadr_t > >>>>>>> returned by dma_map_page. > >>>>>>> > >>>>>> > >>>>>> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here? > >>>>> > >>>>> Hi Jike, > >>>>> > >>>>> With mediated passthru, you still can use hardware iommu, but more important > >>>>> that part is actually orthogonal to what we are discussing here as we will only > >>>>> cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we > >>>>> have pinned pages later with the help of above info, you can map it into the > >>>>> proper iommu domain if the system has configured so. > >>>>> > >>>> > >>>> Hi Neo, > >>>> > >>>> Technically yes you can map a pfn into the proper IOMMU domain elsewhere, > >>>> but to find out whether a pfn was previously mapped or not, you have to > >>>> track it with another rbtree-alike data structure (the IOMMU driver simply > >>>> doesn't bother with tracking), that seems somehow duplicate with the vGPU > >>>> IOMMU backend we are discussing here. > >>>> > >>>> And it is also semantically correct for an IOMMU backend to handle both w/ > >>>> and w/o an IOMMU hardware? :) > >>> > >>> A problem with the iommu doing the dma_map_page() though is for what > >>> device does it do this? In the mediated case the vfio infrastructure > >>> is dealing with a software representation of a device. For all we > >>> know that software model could transparently migrate from one physical > >>> GPU to another. There may not even be a physical device backing > >>> the mediated device. Those are details left to the vgpu driver itself. > >>> > >> > >> Great point :) Yes, I agree it's a bit intrusive to do the mapping for > >> a particular > >> pdev in an vGPU IOMMU BE. > >> > >>> Perhaps one possibility would be to allow the vgpu driver to register > >>> map and unmap callbacks. The unmap callback might provide the > >>> invalidation interface that we're so far missing. The combination of > >>> map and unmap callbacks might simplify the Intel approach of pinning the > >>> entire VM memory space, ie. for each map callback do a translation > >>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release > >>> the translation. > >> > >> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to > >> gpu_device_ops as > >> implemented in Kirti's patch) sounds a good idea, satisfying both: 1) > >> keeping vGPU purely > >> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU > >> compatibility. > >> > >> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has > >> once been had the whole guest memory pinned, only because we used a spinlock, > >> which can't sleep at runtime. We have removed that spinlock in our another > >> upstreaming effort, not here but for i915 driver, so probably no biggie. > >> > > > > OK, then you guys don't need to pin everything. > > Yes :) > > > The next question will be if you > > can send the pinning request from your mediated driver backend to request memory > > pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and > > vfio_unpin_pages? > > Kind of yes, not exactly. > > IMO the mediated driver backend cares not only about pinning, but also the more > important translation. The vfio_pin_pages of v3 patch does the pinning and > translation simultaneously, whereas I do think the API is better named to > 'translate' instead of 'pin', like v2 did. Hi Jike, Let me explain here. The "pin and translation" has to be done all together and the pinning here doesn't mean installing anything into the real IOMMU hardware. The pinning is lock down the underlying pages for a given QEMU VA which will be the corresponding guest physical address. Why we have to do that? If not, the underlying physical pages will be moved and the DMA will not work properly, this is exactly why the default iommu type1 driver use the get_user_pages to *pin* memory. The translation part is easy to understand I think. If you want to read more, you can check the latest email from Alex about a recent regression introduced by THP, where the underlying page has moved by thp for a qemu va, so dma is broken. https://lkml.org/lkml/2016/4/28/604 Once you have the pfn, then the vendor driver can decide what do next. > > We possibly have the same requirement from the mediate driver backend: > > a) get a GFN, when guest try to tell hardware; > b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr? We will provide you the pfn via vfio_pin_pages, so you can map it for dma purpose in your i915 driver, which is what we are doing today. > > The vfio iommu backend search the tracking table with this GFN[1]: > > c) if entry found, return the dma_addr; > d) if nothing found, call GUP to pin the page, and dma_map_page to get the dma_addr[2], return it; > > The dma_addr will be told to real GPU hardware. > > I can't simply say a 'Yes' here, since we may consult dma_addr for a GFN > multiple times, but only for the first time we need to pin the page. It is very important to keep the consistency from kernel point of view and also not trust device driver, for example it would always good to assume the device is going to reference a page whenever he is asking for that information, therefore it is always good to keep the reference counter going if he asks for it. So it is the caller's responsibility to know what they are doing when calling vfio_pin_pages, the same actually applies to get_user_pages. Thanks, Neo > > IOW, pinning is kind of an internal action in the iommu backend. > > > //Sorry for the long, maybe boring explanation.. :) > > > [1] GFN or vaddr, no biggie > [2] As pointed out by Alex, dma_map_page can be called elsewhere like a callback. > > > -- > Thanks, > Jike -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html