Thanks Alexey for your review! On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote: > On 13/12/24 18:08, Chenyi Qiang wrote: >> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >> discard") effectively disables device assignment when using guest_memfd. >> This poses a significant challenge as guest_memfd is essential for >> confidential guests, thereby blocking device assignment to these VMs. >> The initial rationale for disabling device assignment was due to stale >> IOMMU mappings (see Problem section) and the assumption that TEE I/O >> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment >> problem for confidential guests [1]. However, this assumption has proven >> to be incorrect. TEE I/O relies on the ability to operate devices against >> "shared" or untrusted memory, which is crucial for device initialization >> and error recovery scenarios. As a result, the current implementation >> does >> not adequately support device assignment for confidential guests, >> necessitating >> a reevaluation of the approach to ensure compatibility and functionality. >> >> This series enables shared device assignment by notifying VFIO of page >> conversions using an existing framework named RamDiscardListener. >> Additionally, there is an ongoing patch set [2] that aims to add 1G page >> support for guest_memfd. This patch set introduces in-place page >> conversion, >> where private and shared memory share the same physical pages as the >> backend. >> This development may impact our solution. >> >> We presented our solution in the guest_memfd meeting to discuss its >> compatibility with the new changes and potential future directions >> (see [3] >> for more details). The conclusion was that, although our solution may >> not be >> the most elegant (see the Limitation section), it is sufficient for >> now and >> can be easily adapted to future changes. >> >> We are re-posting the patch series with some cleanup and have removed >> the RFC >> label for the main enabling patches (1-6). The newly-added patch 7 is >> still >> marked as RFC as it tries to resolve some extension concerns related to >> RamDiscardManager for future usage. >> >> The overview of the patches: >> - Patch 1: Export a helper to get intersection of a MemoryRegionSection >> with a given range. >> - Patch 2-6: Introduce a new object to manage the guest-memfd with >> RamDiscardManager, and notify the shared/private state change during >> conversion. >> - Patch 7: Try to resolve a semantics concern related to >> RamDiscardManager >> i.e. RamDiscardManager is used to manage memory plug/unplug state >> instead of shared/private state. It would affect future users of >> RamDiscardManger in confidential VMs. Attach it behind as a RFC >> patch[4]. >> >> Changes since last version: >> - Add a patch to export some generic helper functions from virtio-mem >> code. >> - Change the bitmap in guest_memfd_manager from default shared to default >> private. This keeps alignment with virtio-mem that 1-setting in bitmap >> represents the populated state and may help to export more generic >> code >> if necessary. >> - Add the helpers to initialize/uninitialize the guest_memfd_manager >> instance >> to make it more clear. >> - Add a patch to distinguish between the shared/private state change and >> the memory plug/unplug state change in RamDiscardManager. >> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1- >> chenyi.qiang@xxxxxxxxx/ >> >> --- >> >> Background >> ========== >> Confidential VMs have two classes of memory: shared and private memory. >> Shared memory is accessible from the host/VMM while private memory is >> not. Confidential VMs can decide which memory is shared/private and >> convert memory between shared/private at runtime. >> >> "guest_memfd" is a new kind of fd whose primary goal is to serve guest >> private memory. The key differences between guest_memfd and normal memfd >> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and >> cannot be mapped, read or written by userspace. > > The "cannot be mapped" seems to be not true soon anymore (if not already). > > https://lore.kernel.org/all/20240801090117.3841080-1-tabba@xxxxxxxxxx/T/ Exactly, allowing guest_memfd to do mmap is the direction. I mentioned it below with in-place page conversion. Maybe I would move it here to make it more clear. > > >> >> In QEMU's implementation, shared memory is allocated with normal methods >> (e.g. mmap or fallocate) while private memory is allocated from >> guest_memfd. When a VM performs memory conversions, QEMU frees pages via >> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and >> allocates new pages from the other side. >> [...] >> >> One limitation (also discussed in the guest_memfd meeting) is that VFIO >> expects the DMA mapping for a specific IOVA to be mapped and unmapped >> with >> the same granularity. The guest may perform partial conversions, such as >> converting a small region within a larger region. To prevent such invalid >> cases, all operations are performed with 4K granularity. The possible >> solutions we can think of are either to enable VFIO to support partial >> unmap >> or to implement an enlightened guest to avoid partial conversion. The >> former >> requires complex changes in VFIO, while the latter requires the page >> conversion to be a guest-enlightened behavior. It is still uncertain >> which >> option is a preferred one. > > in-place memory conversion is :) > >> >> Testing >> ======= >> This patch series is tested with the KVM/QEMU branch: >> KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20 >> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream- >> snapshot-2024-12-13 > > > The branch is gone now? tdx-upstream-snapshot-2024-12-18 seems to have > these though. Thanks, Thanks for pointing it out. You're right, tdx-upstream-snapshot-2024-12-18 is the latest branch. I added the fixup for patch 1 and forgot to update the change here. > >> >> To facilitate shared device assignment with the NIC, employ the legacy >> type1 VFIO with the QEMU command: >> >> qemu-system-x86_64 [...] >> -device vfio-pci,host=XX:XX.X >> >> The parameter of dma_entry_limit needs to be adjusted. For example, a >> 16GB guest needs to adjust the parameter like >> vfio_iommu_type1.dma_entry_limit=4194304. >> >> If use the iommufd-backed VFIO with the qemu command: >> >> qemu-system-x86_64 [...] >> -object iommufd,id=iommufd0 \ >> -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 >> >> No additional adjustment required. >> >> Following the bootup of the TD guest, the guest's IP address becomes >> visible, and iperf is able to successfully send and receive data. >