Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

Jike Song <albcamus@xxxxxxxxx> · Thu, 12 May 2016 12:11:00 +0800

On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:
> On Wed, 11 May 2016 17:15:15 +0800
> Jike Song <jike.song@xxxxxxxxx> wrote:
>
>> On 05/11/2016 12:02 AM, Neo Jia wrote:
>> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
>> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>> >>>> From: Song, Jike
>> >>>>
>> >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>> >>>> hardware. It just, as you said in another mail, "rather than
>> >>>> programming them into an IOMMU for a device, it simply stores the
>> >>>> translations for use by later requests".
>> >>>>
>> >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
>> >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
>> >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>> >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>> >>>> translations without any knowledge about hardware IOMMU, how is the
>> >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
>> >>>> by the IOMMU backend here)?
>> >>>>
>> >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
>> >>>> pin & translate vaddr to PFN, then it will be very difficult for the
>> >>>> device model to figure out:
>> >>>>
>> >>>>  1, for a given GPA, how to avoid calling dma_map_page multiple times?
>> >>>>  2, for which page to call dma_unmap_page?
>> >>>>
>> >>>> --
>> >>>
>> >>> We have to support both w/ iommu and w/o iommu case, since
>> >>> that fact is out of GPU driver control. A simple way is to use
>> >>> dma_map_page which internally will cope with w/ and w/o iommu
>> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
>> >>> Then in this file we only need to cache GPA to whatever dmadr_t
>> >>> returned by dma_map_page.
>> >>>
>> >>
>> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
>> >
>> > Hi Jike,
>> >
>> > With mediated passthru, you still can use hardware iommu, but more important
>> > that part is actually orthogonal to what we are discussing here as we will only
>> > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
>> > have pinned pages later with the help of above info, you can map it into the
>> > proper iommu domain if the system has configured so.
>> >
>>
>> Hi Neo,
>>
>> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
>> but to find out whether a pfn was previously mapped or not, you have to
>> track it with another rbtree-alike data structure (the IOMMU driver simply
>> doesn't bother with tracking), that seems somehow duplicate with the vGPU
>> IOMMU backend we are discussing here.
>>
>> And it is also semantically correct for an IOMMU backend to handle both w/
>> and w/o an IOMMU hardware? :)
>
> A problem with the iommu doing the dma_map_page() though is for what
> device does it do this?  In the mediated case the vfio infrastructure
> is dealing with a software representation of a device.  For all we
> know that software model could transparently migrate from one physical
> GPU to another.  There may not even be a physical device backing
> the mediated device.  Those are details left to the vgpu driver itself.
>

Great point :) Yes, I agree it's a bit intrusive to do the mapping for
a particular
pdev in an vGPU IOMMU BE.

> Perhaps one possibility would be to allow the vgpu driver to register
> map and unmap callbacks.  The unmap callback might provide the
> invalidation interface that we're so far missing.  The combination of
> map and unmap callbacks might simplify the Intel approach of pinning the
> entire VM memory space, ie. for each map callback do a translation
> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> the translation.

Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
gpu_device_ops as
implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
keeping vGPU purely
virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
compatibility.

PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
once been had the whole guest memory pinned, only because we used a spinlock,
which can't sleep at runtime.  We have removed that spinlock in our another
upstreaming effort, not here but for i915 driver, so probably no biggie.

> There's still the problem of where that dma_addr_t
> from the dma_map_page is stored though.  Someone would need to keep
> track of iova to dma_addr_t.  The vfio iommu might be a place to do
> that since we're already tracking information based on iova, possibly
> in an opaque data element provided by the vgpu driver.

Any reason to keep it opaque? Given that vfio iommu is already tracking
PFN for iova (vaddr as vGPU is), seems adding dma_addr_t as another field is
simple. But I don't have a strong opinion here, opaque definitely
works for me :)

> However, we're
> going to need to take a serious look at whether an rb-tree is the right
> data structure for the job.  It works well for the current type1
> functionality where we typically have tens of entries.  I think the
> NVIDIA model of sparse pinning the VM is pushing that up to tens of
> thousands.  If Intel intends to pin the entire guest, that's
> potentially tens of millions of tracked entries and I don't know that
> an rb-tree is the right tool for that job.  Thanks,
>

Having the rbtree efficiency considered there is yet another reason for us
to pin partially. Assuming that partially pinning guaranteed, do you
think rbtree
is good enough?

--
Thanks,
Jike
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html