On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote: > On 9/1/21 3:24 AM, Yu Zhang wrote: > > On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote: > >> > >> > >> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: > >>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: > >> > >>> Thanks a lot for this summary. A question about the requirement: do we or > >>> do we not have plan to support assigned device to the protected VM? > >>> > >>> If yes. The fd based solution may need change the VFIO interface as well( > >>> though the fake swap entry solution need mess with VFIO too). Because: > >>> > >>> 1> KVM uses VFIO when assigning devices into a VM. > >>> > >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all > >>> guest pages will have to be mapped in host IOMMU page table to host pages, > >>> which are pinned during the whole life cycle fo the VM. > >>> > >>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, > >>> in vfio_dma_do_map(). > >>> > >>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA > >>> and pin the page. > >>> > >>> But if we are using fd based solution, not every GPA can have a HVA, thus > >>> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I > >>> doubt if VFIO can be modified to support this easily. > >>> > >>> > >> > >> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device? > >> > >> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it. I'm not sure whether anyone will actually do this, but it should be possible, at least in principle. Maybe someone will want to assign a NIC to a TDX guest. An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether. > >> > >> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot. > >> > > > > Thanks Andy. I was asking the first scenario. > > > > Well, I agree it is doable if someone really want some assigned > > device in TD guest. As Kevin mentioned in his reply, HPA can be > > generated, by extending VFIO with a new mapping protocol which > > uses fd+offset, instead of HVA. > > I'm confused. I don't see why any new code is needed for this at all. > Every proposal I've seen for handling TDX memory continues to handle TDX > *shared* memory exactly like regular guest memory today. The only > differences are that more hole punching will be needed, which will > require lightweight memslots (to have many of them), memslots with > holes, or mappings backing memslots with holes (which can be done with > munmap() on current kernels). Thanks for pointing this out. And yes, for DMAs not capable of encryption( which is the case in current TDX). GUP shall work as it is in VFIO. :) > > So you can literally just mmap a VFIO device and expect it to work, > exactly like it does right now. Whether the guest will be willing to > use the device will depend on the guest security policy (all kinds of > patches about that are flying around), but if the guest tries to use it, > it really should just work. > But I think there's still problem. For now, 1> Qemu mmap()s all GPAs into its HVA space, when the VM is created. 2> With no idea which part of guest memory shall be shared, VFIO will just set up the IOPT, by mapping whole GPA ranges in IOPT. 3> And those GPAs are actually private ones, with no shared-bit set. Later when guest tries to perform a DMA(using a shared GPA), IO page fault shall happen. > > > > Another issue is current TDX does not support DMA encryption, and > > only shared GPA memory shall be mapped in the VT-d. So to support > > this, KVM may need to work with VFIO to dynamically program host > > IOPT(IOMMU Page Table) when TD guest notifies a shared GFN range(e.g., > > with a MAP_GPA TDVMCALL), instead of prepopulating the IOPT at VM > > creation time, by mapping entire GFN ranges of a guest. > > Given that there is no encrypted DMA support, shouldn't the only IOMMU > mappings (real host-side IOMMU) that point at guest memory be for > non-encrypted DMA? I don't see how this interacts at all. If the guest > tries to MapGPA to turn a shared MMIO page into private, the host should > fail the hypercall because the operation makes no sense. > > It is indeed the case that, with a TDX guest, MapGPA shared->private to > a page that was previously used for unencrypted DMA will need to avoid > having IOPT entries to the new private page, but even that doesn't seem > particularly bad. The fd+special memslot proposal for private memory > means that shared *backing store* pages never actually transition > between shared and private without being completely freed. > > As far as I can tell, the actual problem you're referring to is: > > >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all > >>> guest pages will have to be mapped in host IOMMU page table to host > pages, > >>> which are pinned during the whole life cycle fo the VM. Yes. That's the primary concern. :) > > In principle, you could actually initialize a TDX guest with all of its > memory shared and all of it mapped in the host IOMMU. When a guest > turns some pages private, user code could punch a hole in the memslot, > allocate private memory at that address, but leave the shared backing > store in place and still mapped in the host IOMMU. The result would be > that guest-initiated DMA to the previously shared address would actually > work but would hit pages that are invisible to the guest. And a whole > bunch of memory would be waste, but the whole system should stll work. Do you mean to let VFIO & IOMMU to treat all guest memory as shared first, and then just allocate the private pages in another backing store? I guess that could work, but with the cost of allocating roughly 2x physical pages of the guest RAM size. After all, the shared pages shall be only a small part of guest memory. If device assignment is desired in current TDX. My understanding of the enabling work would be like this: 1> Change qemu to not trigger VFIO_IOMMU_MAP_DMA for the TD, thus no IOPT prepopulated, and no physical page allocated. 2> KVM forwards MapGPA(private -> shared) request to Qemu. 3> Qemu asks VFIO to pin and map the shared GPAs. For private -> shared transitions, the memslot punching, IOPT unmapping, and iotlb flushing are necessary. Possibly new interface between VFIO and KVM is needed. But actually I am not sure if people really want assigned device in current TDX. Bottleneck of the performance should be the copying to/from swiotlb buffers. B.R. Yu