Hi Paolo, Kindly ping for this thread. The in-place page conversion is discussed at Linux Plumbers. Does it give some direction for shared device assignment enabling work? Thanks Chenyi On 8/16/2024 11:02 AM, Chenyi Qiang wrote: > Hi Paolo, > > Hope to draw your attention. As TEE I/O would depend on shared device > assignment and we introduce this RDM solution in QEMU. Now, Observe the > in-place private/shared conversion option mentioned by David, do you > think we should continue to add pass-thru support for this in-qemu page > conversion method? Or wait for the option discussion to see if it will > change to in-kernel conversion. > > Thanks > Chenyi > > On 7/25/2024 3:21 PM, Chenyi Qiang wrote: >> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >> discard") effectively disables device assignment with guest_memfd. >> guest_memfd is required for confidential guests, so device assignment to >> confidential guests is disabled. A supporting assumption for disabling >> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO >> etc...) solves the confidential-guest device-assignment problem [1]. >> That turns out not to be the case because TEE I/O depends on being able >> to operate devices against "shared"/untrusted memory for device >> initialization and error recovery scenarios. >> >> This series utilizes an existing framework named RamDiscardManager to >> notify VFIO of page conversions. However, there's still one concern >> related to the semantics of RamDiscardManager which is used to manage >> the memory plug/unplug state. This is a little different from the memory >> shared/private in our requirement. See the "Open" section below for more >> details. >> >> Background >> ========== >> Confidential VMs have two classes of memory: shared and private memory. >> Shared memory is accessible from the host/VMM while private memory is >> not. Confidential VMs can decide which memory is shared/private and >> convert memory between shared/private at runtime. >> >> "guest_memfd" is a new kind of fd whose primary goal is to serve guest >> private memory. The key differences between guest_memfd and normal memfd >> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and >> cannot be mapped, read or written by userspace. >> >> In QEMU's implementation, shared memory is allocated with normal methods >> (e.g. mmap or fallocate) while private memory is allocated from >> guest_memfd. When a VM performs memory conversions, QEMU frees pages via >> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and >> allocates new pages from the other side. >> >> Problem >> ======= >> Device assignment in QEMU is implemented via VFIO system. In the normal >> VM, VM memory is pinned at the beginning of time by VFIO. In the >> confidential VM, the VM can convert memory and when that happens >> nothing currently tells VFIO that its mappings are stale. This means >> that page conversion leaks memory and leaves stale IOMMU mappings. For >> example, sequence like the following can result in stale IOMMU mappings: >> >> 1. allocate shared page >> 2. convert page shared->private >> 3. discard shared page >> 4. convert page private->shared >> 5. allocate shared page >> 6. issue DMA operations against that shared page >> >> After step 3, VFIO is still pinning the page. However, DMA operations in >> step 6 will hit the old mapping that was allocated in step 1, which >> causes the device to access the invalid data. >> >> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require >> uncoordinated discard") has blocked the device assignment with >> guest_memfd to avoid this problem. >> >> Solution >> ======== >> The key to enable shared device assignment is to solve the stale IOMMU >> mappings problem. >> >> Given the constraints and assumptions here is a solution that satisfied >> the use cases. RamDiscardManager, an existing interface currently >> utilized by virtio-mem, offers a means to modify IOMMU mappings in >> accordance with VM page assignment. Page conversion is similar to >> hot-removing a page in one mode and adding it back in the other. >> >> This series implements a RamDiscardManager for confidential VMs and >> utilizes its infrastructure to notify VFIO of page conversions. >> >> Another possible attempt [2] was to not discard shared pages in step 3 >> above. This was an incomplete band-aid because guests would consume >> twice the memory since shared pages wouldn't be freed even after they >> were converted to private. >> >> Open >> ==== >> Implementing a RamDiscardManager to notify VFIO of page conversions >> causes changes in semantics: private memory is treated as discarded (or >> hot-removed) memory. This isn't aligned with the expectation of current >> RamDiscardManager users (e.g. VFIO or live migration) who really >> expect that discarded memory is hot-removed and thus can be skipped when >> the users are processing guest memory. Treating private memory as >> discarded won't work in future if VFIO or live migration needs to handle >> private memory. e.g. VFIO may need to map private memory to support >> Trusted IO and live migration for confidential VMs need to migrate >> private memory. >> >> There are two possible ways to mitigate the semantics changes. >> 1. Develop a new mechanism to notify the page conversions between >> private and shared. For example, utilize the notifier_list in QEMU. VFIO >> registers its own handler and gets notified upon page conversions. This >> is a clean approach which only touches the notifier workflow. A >> challenge is that for device hotplug, existing shared memory should be >> mapped in IOMMU. This will need additional changes. >> >> 2. Extend the existing RamDiscardManager interface to manage not only >> the discarded/populated status of guest memory but also the >> shared/private status. RamDiscardManager users like VFIO will be >> notified with one more argument indicating what change is happening and >> can take action accordingly. It also has challenges e.g. QEMU allows >> only one RamDiscardManager, how to support virtio-mem for confidential >> VMs would be a problem. And some APIs like .is_populated() exposed by >> RamDiscardManager are meaningless to shared/private memory. So they may >> need some adjustments. >> >> Testing >> ======= >> This patch series is tested based on the internal TDX KVM/QEMU tree. >> >> To facilitate shared device assignment with the NIC, employ the legacy >> type1 VFIO with the QEMU command: >> >> qemu-system-x86_64 [...] >> -device vfio-pci,host=XX:XX.X >> >> The parameter of dma_entry_limit needs to be adjusted. For example, a >> 16GB guest needs to adjust the parameter like >> vfio_iommu_type1.dma_entry_limit=4194304. >> >> If use the iommufd-backed VFIO with the qemu command: >> >> qemu-system-x86_64 [...] >> -object iommufd,id=iommufd0 \ >> -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 >> >> No additional adjustment required. >> >> Following the bootup of the TD guest, the guest's IP address becomes >> visible, and iperf is able to successfully send and receive data. >> >> Related link >> ============ >> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@xxxxxxxxxx/ >> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@xxxxxxx/ >> >> Chenyi Qiang (6): >> guest_memfd: Introduce an object to manage the guest-memfd with >> RamDiscardManager >> guest_memfd: Introduce a helper to notify the shared/private state >> change >> KVM: Notify the state change via RamDiscardManager helper during >> shared/private conversion >> memory: Register the RamDiscardManager instance upon guest_memfd >> creation >> guest-memfd: Default to discarded (private) in guest_memfd_manager >> RAMBlock: make guest_memfd require coordinate discard >> >> accel/kvm/kvm-all.c | 7 + >> include/sysemu/guest-memfd-manager.h | 49 +++ >> system/guest-memfd-manager.c | 425 +++++++++++++++++++++++++++ >> system/meson.build | 1 + >> system/physmem.c | 11 +- >> 5 files changed, 492 insertions(+), 1 deletion(-) >> create mode 100644 include/sysemu/guest-memfd-manager.h >> create mode 100644 system/guest-memfd-manager.c >> >> >> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819