Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory

David Hildenbrand <david@xxxxxxxxxx> · Wed, 1 Sep 2021 18:27:20 +0200

On 01.09.21 18:07, Andy Lutomirski wrote:
On 9/1/21 3:24 AM, Yu Zhang wrote:
On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote:

On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote:
On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote:

Thanks a lot for this summary. A question about the requirement: do we or
do we not have plan to support assigned device to the protected VM?

If yes. The fd based solution may need change the VFIO interface as well(
though the fake swap entry solution need mess with VFIO too). Because:

1> KVM uses VFIO when assigning devices into a VM.

2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all
guest pages will have to be mapped in host IOMMU page table to host pages,
which are pinned during the whole life cycle fo the VM.

3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver,
in vfio_dma_do_map().

4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA
and pin the page.

But if we are using fd based solution, not every GPA can have a HVA, thus
the current VFIO interface to map and pin the GPA(IOVA) wont work. And I
doubt if VFIO can be modified to support this easily.

Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device?

If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it.  I'm not sure whether anyone will actually do this, but it should be possible, at least in principle.  Maybe someone will want to assign a NIC to a TDX guest.  An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether.

If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot.

Thanks Andy. I was asking the first scenario.

Well, I agree it is doable if someone really want some assigned
device in TD guest. As Kevin mentioned in his reply, HPA can be
generated, by extending VFIO with a new mapping protocol which
uses fd+offset, instead of HVA.

I'm confused.  I don't see why any new code is needed for this at all.
Every proposal I've seen for handling TDX memory continues to handle TDX
*shared* memory exactly like regular guest memory today.  The only
differences are that more hole punching will be needed, which will
require lightweight memslots (to have many of them), memslots with
holes, or mappings backing memslots with holes (which can be done with
munmap() on current kernels).

So you can literally just mmap a VFIO device and expect it to work,
exactly like it does right now.  Whether the guest will be willing to
use the device will depend on the guest security policy (all kinds of
patches about that are flying around), but if the guest tries to use it,
it really should just work.

... but if you end up mapping private memory into IOMMU of the device 
and the device ends up accessing that memory, we're in the same position 
that the host might get capped, just like access from user space, no?

Sure, you can map only the complete duplicate shared-memory region into 
the IOMMU of the device, that would work. Shame vfio mostly always pins 
all guest memory and you essentially cannot punch holes into the shared 
memory anymore -- resulting in the worst case in a duplicate memory 
consumption for your VM.

So you'd actually want to map only the *currently* shared pieces into 
the IOMMU and update the mappings on demand. Having worked on something 
related, I can only say that 64k individual mappings, and not being able 
to modify existing mappings except completely deleting them to replace 
with something new (!atomic), can be quite an issue for bigger VMs.

--
Thanks,

David / dhildenb