On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote: > > > On 8/1/25 17:28, Chenyi Qiang wrote: >> Thanks Alexey for your review! >> >> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote: >>> On 13/12/24 18:08, Chenyi Qiang wrote: >>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >>>> discard") effectively disables device assignment when using >>>> guest_memfd. >>>> This poses a significant challenge as guest_memfd is essential for >>>> confidential guests, thereby blocking device assignment to these VMs. >>>> The initial rationale for disabling device assignment was due to stale >>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O >>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment >>>> problem for confidential guests [1]. However, this assumption has >>>> proven >>>> to be incorrect. TEE I/O relies on the ability to operate devices >>>> against >>>> "shared" or untrusted memory, which is crucial for device >>>> initialization >>>> and error recovery scenarios. As a result, the current implementation >>>> does >>>> not adequately support device assignment for confidential guests, >>>> necessitating >>>> a reevaluation of the approach to ensure compatibility and >>>> functionality. >>>> >>>> This series enables shared device assignment by notifying VFIO of page >>>> conversions using an existing framework named RamDiscardListener. >>>> Additionally, there is an ongoing patch set [2] that aims to add 1G >>>> page >>>> support for guest_memfd. This patch set introduces in-place page >>>> conversion, >>>> where private and shared memory share the same physical pages as the >>>> backend. >>>> This development may impact our solution. >>>> >>>> We presented our solution in the guest_memfd meeting to discuss its >>>> compatibility with the new changes and potential future directions >>>> (see [3] >>>> for more details). The conclusion was that, although our solution may >>>> not be >>>> the most elegant (see the Limitation section), it is sufficient for >>>> now and >>>> can be easily adapted to future changes. >>>> >>>> We are re-posting the patch series with some cleanup and have removed >>>> the RFC >>>> label for the main enabling patches (1-6). The newly-added patch 7 is >>>> still >>>> marked as RFC as it tries to resolve some extension concerns related to >>>> RamDiscardManager for future usage. >>>> >>>> The overview of the patches: >>>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection >>>> with a given range. >>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with >>>> RamDiscardManager, and notify the shared/private state change >>>> during >>>> conversion. >>>> - Patch 7: Try to resolve a semantics concern related to >>>> RamDiscardManager >>>> i.e. RamDiscardManager is used to manage memory plug/unplug state >>>> instead of shared/private state. It would affect future users of >>>> RamDiscardManger in confidential VMs. Attach it behind as a RFC >>>> patch[4]. >>>> >>>> Changes since last version: >>>> - Add a patch to export some generic helper functions from virtio-mem >>>> code. >>>> - Change the bitmap in guest_memfd_manager from default shared to >>>> default >>>> private. This keeps alignment with virtio-mem that 1-setting in >>>> bitmap >>>> represents the populated state and may help to export more generic >>>> code >>>> if necessary. >>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager >>>> instance >>>> to make it more clear. >>>> - Add a patch to distinguish between the shared/private state change >>>> and >>>> the memory plug/unplug state change in RamDiscardManager. >>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1- >>>> chenyi.qiang@xxxxxxxxx/ >>>> >>>> --- >>>> >>>> Background >>>> ========== >>>> Confidential VMs have two classes of memory: shared and private memory. >>>> Shared memory is accessible from the host/VMM while private memory is >>>> not. Confidential VMs can decide which memory is shared/private and >>>> convert memory between shared/private at runtime. >>>> >>>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest >>>> private memory. The key differences between guest_memfd and normal >>>> memfd >>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner >>>> VM and >>>> cannot be mapped, read or written by userspace. >>> >>> The "cannot be mapped" seems to be not true soon anymore (if not >>> already). >>> >>> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@xxxxxxxxxx/T/ >> >> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned >> it below with in-place page conversion. Maybe I would move it here to >> make it more clear. >> >>> >>> >>>> >>>> In QEMU's implementation, shared memory is allocated with normal >>>> methods >>>> (e.g. mmap or fallocate) while private memory is allocated from >>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages >>>> via >>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and >>>> allocates new pages from the other side. >>>> >> >> [...] >> >>>> >>>> One limitation (also discussed in the guest_memfd meeting) is that VFIO >>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped >>>> with >>>> the same granularity. The guest may perform partial conversions, >>>> such as >>>> converting a small region within a larger region. To prevent such >>>> invalid >>>> cases, all operations are performed with 4K granularity. The possible >>>> solutions we can think of are either to enable VFIO to support partial >>>> unmap > > btw the old VFIO does not split mappings but iommufd seems to be capable > of it - there is iopt_area_split(). What happens if you try unmapping a > smaller chunk that does not exactly match any mapped chunk? thanks, iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with disable_large_page=true. That means the large IOPTE is also disabled in IOMMU. So it can do the split easily. See the comment in iommufd_vfio_set_iommu(). iommufd VFIO compatible mode is a transition from legacy VFIO to iommufd. For the normal iommufd, it requires the iova/length must be a superset of a previously mapped range. If not match, will return error. > >