Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated discard") effectively disables device assignment with guest_memfd. guest_memfd is required for confidential guests, so device assignment to confidential guests is disabled. A supporting assumption for disabling device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO etc...) solves the confidential-guest device-assignment problem [1]. That turns out not to be the case because TEE I/O depends on being able to operate devices against "shared"/untrusted memory for device initialization and error recovery scenarios. This series utilizes an existing framework named RamDiscardManager to notify VFIO of page conversions. However, there's still one concern related to the semantics of RamDiscardManager which is used to manage the memory plug/unplug state. This is a little different from the memory shared/private in our requirement. See the "Open" section below for more details. Background ========== Confidential VMs have two classes of memory: shared and private memory. Shared memory is accessible from the host/VMM while private memory is not. Confidential VMs can decide which memory is shared/private and convert memory between shared/private at runtime. "guest_memfd" is a new kind of fd whose primary goal is to serve guest private memory. The key differences between guest_memfd and normal memfd are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and cannot be mapped, read or written by userspace. In QEMU's implementation, shared memory is allocated with normal methods (e.g. mmap or fallocate) while private memory is allocated from guest_memfd. When a VM performs memory conversions, QEMU frees pages via madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and allocates new pages from the other side. Problem ======= Device assignment in QEMU is implemented via VFIO system. In the normal VM, VM memory is pinned at the beginning of time by VFIO. In the confidential VM, the VM can convert memory and when that happens nothing currently tells VFIO that its mappings are stale. This means that page conversion leaks memory and leaves stale IOMMU mappings. For example, sequence like the following can result in stale IOMMU mappings: 1. allocate shared page 2. convert page shared->private 3. discard shared page 4. convert page private->shared 5. allocate shared page 6. issue DMA operations against that shared page After step 3, VFIO is still pinning the page. However, DMA operations in step 6 will hit the old mapping that was allocated in step 1, which causes the device to access the invalid data. Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated discard") has blocked the device assignment with guest_memfd to avoid this problem. Solution ======== The key to enable shared device assignment is to solve the stale IOMMU mappings problem. Given the constraints and assumptions here is a solution that satisfied the use cases. RamDiscardManager, an existing interface currently utilized by virtio-mem, offers a means to modify IOMMU mappings in accordance with VM page assignment. Page conversion is similar to hot-removing a page in one mode and adding it back in the other. This series implements a RamDiscardManager for confidential VMs and utilizes its infrastructure to notify VFIO of page conversions. Another possible attempt [2] was to not discard shared pages in step 3 above. This was an incomplete band-aid because guests would consume twice the memory since shared pages wouldn't be freed even after they were converted to private. Open ==== Implementing a RamDiscardManager to notify VFIO of page conversions causes changes in semantics: private memory is treated as discarded (or hot-removed) memory. This isn't aligned with the expectation of current RamDiscardManager users (e.g. VFIO or live migration) who really expect that discarded memory is hot-removed and thus can be skipped when the users are processing guest memory. Treating private memory as discarded won't work in future if VFIO or live migration needs to handle private memory. e.g. VFIO may need to map private memory to support Trusted IO and live migration for confidential VMs need to migrate private memory. There are two possible ways to mitigate the semantics changes. 1. Develop a new mechanism to notify the page conversions between private and shared. For example, utilize the notifier_list in QEMU. VFIO registers its own handler and gets notified upon page conversions. This is a clean approach which only touches the notifier workflow. A challenge is that for device hotplug, existing shared memory should be mapped in IOMMU. This will need additional changes. 2. Extend the existing RamDiscardManager interface to manage not only the discarded/populated status of guest memory but also the shared/private status. RamDiscardManager users like VFIO will be notified with one more argument indicating what change is happening and can take action accordingly. It also has challenges e.g. QEMU allows only one RamDiscardManager, how to support virtio-mem for confidential VMs would be a problem. And some APIs like .is_populated() exposed by RamDiscardManager are meaningless to shared/private memory. So they may need some adjustments. Testing ======= This patch series is tested based on the internal TDX KVM/QEMU tree. To facilitate shared device assignment with the NIC, employ the legacy type1 VFIO with the QEMU command: qemu-system-x86_64 [...] -device vfio-pci,host=XX:XX.X The parameter of dma_entry_limit needs to be adjusted. For example, a 16GB guest needs to adjust the parameter like vfio_iommu_type1.dma_entry_limit=4194304. If use the iommufd-backed VFIO with the qemu command: qemu-system-x86_64 [...] -object iommufd,id=iommufd0 \ -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 No additional adjustment required. Following the bootup of the TD guest, the guest's IP address becomes visible, and iperf is able to successfully send and receive data. Related link ============ [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@xxxxxxxxxx/ [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@xxxxxxx/ Chenyi Qiang (6): guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager guest_memfd: Introduce a helper to notify the shared/private state change KVM: Notify the state change via RamDiscardManager helper during shared/private conversion memory: Register the RamDiscardManager instance upon guest_memfd creation guest-memfd: Default to discarded (private) in guest_memfd_manager RAMBlock: make guest_memfd require coordinate discard accel/kvm/kvm-all.c | 7 + include/sysemu/guest-memfd-manager.h | 49 +++ system/guest-memfd-manager.c | 425 +++++++++++++++++++++++++++ system/meson.build | 1 + system/physmem.c | 11 +- 5 files changed, 492 insertions(+), 1 deletion(-) create mode 100644 include/sysemu/guest-memfd-manager.h create mode 100644 system/guest-memfd-manager.c base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819 -- 2.43.5