On 1/10/2025 9:42 AM, Alexey Kardashevskiy wrote: > > > On 9/1/25 19:49, Chenyi Qiang wrote: >> >> >> On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote: >>> >>> >>> On 9/1/25 18:52, Chenyi Qiang wrote: >>>> >>>> >>>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote: >>>>> >>>>> >>>>> On 8/1/25 17:28, Chenyi Qiang wrote: >>>>>> Thanks Alexey for your review! >>>>>> >>>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote: >>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote: >>>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require >>>>>>>> uncoordinated >>>>>>>> discard") effectively disables device assignment when using >>>>>>>> guest_memfd. >>>>>>>> This poses a significant challenge as guest_memfd is essential for >>>>>>>> confidential guests, thereby blocking device assignment to these >>>>>>>> VMs. >>>>>>>> The initial rationale for disabling device assignment was due to >>>>>>>> stale >>>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE >>>>>>>> I/O >>>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device- >>>>>>>> assignment >>>>>>>> problem for confidential guests [1]. However, this assumption has >>>>>>>> proven >>>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices >>>>>>>> against >>>>>>>> "shared" or untrusted memory, which is crucial for device >>>>>>>> initialization >>>>>>>> and error recovery scenarios. As a result, the current >>>>>>>> implementation >>>>>>>> does >>>>>>>> not adequately support device assignment for confidential guests, >>>>>>>> necessitating >>>>>>>> a reevaluation of the approach to ensure compatibility and >>>>>>>> functionality. >>>>>>>> >>>>>>>> This series enables shared device assignment by notifying VFIO of >>>>>>>> page >>>>>>>> conversions using an existing framework named RamDiscardListener. >>>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G >>>>>>>> page >>>>>>>> support for guest_memfd. This patch set introduces in-place page >>>>>>>> conversion, >>>>>>>> where private and shared memory share the same physical pages as >>>>>>>> the >>>>>>>> backend. >>>>>>>> This development may impact our solution. >>>>>>>> >>>>>>>> We presented our solution in the guest_memfd meeting to discuss its >>>>>>>> compatibility with the new changes and potential future directions >>>>>>>> (see [3] >>>>>>>> for more details). The conclusion was that, although our >>>>>>>> solution may >>>>>>>> not be >>>>>>>> the most elegant (see the Limitation section), it is sufficient for >>>>>>>> now and >>>>>>>> can be easily adapted to future changes. >>>>>>>> >>>>>>>> We are re-posting the patch series with some cleanup and have >>>>>>>> removed >>>>>>>> the RFC >>>>>>>> label for the main enabling patches (1-6). The newly-added patch >>>>>>>> 7 is >>>>>>>> still >>>>>>>> marked as RFC as it tries to resolve some extension concerns >>>>>>>> related to >>>>>>>> RamDiscardManager for future usage. >>>>>>>> >>>>>>>> The overview of the patches: >>>>>>>> - Patch 1: Export a helper to get intersection of a >>>>>>>> MemoryRegionSection >>>>>>>> with a given range. >>>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with >>>>>>>> RamDiscardManager, and notify the shared/private state change >>>>>>>> during >>>>>>>> conversion. >>>>>>>> - Patch 7: Try to resolve a semantics concern related to >>>>>>>> RamDiscardManager >>>>>>>> i.e. RamDiscardManager is used to manage memory plug/unplug >>>>>>>> state >>>>>>>> instead of shared/private state. It would affect future >>>>>>>> users of >>>>>>>> RamDiscardManger in confidential VMs. Attach it behind as >>>>>>>> a RFC >>>>>>>> patch[4]. >>>>>>>> >>>>>>>> Changes since last version: >>>>>>>> - Add a patch to export some generic helper functions from >>>>>>>> virtio-mem >>>>>>>> code. >>>>>>>> - Change the bitmap in guest_memfd_manager from default shared to >>>>>>>> default >>>>>>>> private. This keeps alignment with virtio-mem that 1- >>>>>>>> setting in >>>>>>>> bitmap >>>>>>>> represents the populated state and may help to export more >>>>>>>> generic >>>>>>>> code >>>>>>>> if necessary. >>>>>>>> - Add the helpers to initialize/uninitialize the >>>>>>>> guest_memfd_manager >>>>>>>> instance >>>>>>>> to make it more clear. >>>>>>>> - Add a patch to distinguish between the shared/private state >>>>>>>> change >>>>>>>> and >>>>>>>> the memory plug/unplug state change in RamDiscardManager. >>>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1- >>>>>>>> chenyi.qiang@xxxxxxxxx/ >>>>>>>> >>>>>>>> --- >>>>>>>> >>>>>>>> Background >>>>>>>> ========== >>>>>>>> Confidential VMs have two classes of memory: shared and private >>>>>>>> memory. >>>>>>>> Shared memory is accessible from the host/VMM while private >>>>>>>> memory is >>>>>>>> not. Confidential VMs can decide which memory is shared/private and >>>>>>>> convert memory between shared/private at runtime. >>>>>>>> >>>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve >>>>>>>> guest >>>>>>>> private memory. The key differences between guest_memfd and normal >>>>>>>> memfd >>>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner >>>>>>>> VM and >>>>>>>> cannot be mapped, read or written by userspace. >>>>>>> >>>>>>> The "cannot be mapped" seems to be not true soon anymore (if not >>>>>>> already). >>>>>>> >>>>>>> https://lore.kernel.org/all/20240801090117.3841080-1- >>>>>>> tabba@xxxxxxxxxx/T/ >>>>>> >>>>>> Exactly, allowing guest_memfd to do mmap is the direction. I >>>>>> mentioned >>>>>> it below with in-place page conversion. Maybe I would move it here to >>>>>> make it more clear. >>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> In QEMU's implementation, shared memory is allocated with normal >>>>>>>> methods >>>>>>>> (e.g. mmap or fallocate) while private memory is allocated from >>>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees >>>>>>>> pages >>>>>>>> via >>>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one >>>>>>>> side and >>>>>>>> allocates new pages from the other side. >>>>>>>> >>>>>> >>>>>> [...] >>>>>> >>>>>>>> >>>>>>>> One limitation (also discussed in the guest_memfd meeting) is that >>>>>>>> VFIO >>>>>>>> expects the DMA mapping for a specific IOVA to be mapped and >>>>>>>> unmapped >>>>>>>> with >>>>>>>> the same granularity. The guest may perform partial conversions, >>>>>>>> such as >>>>>>>> converting a small region within a larger region. To prevent such >>>>>>>> invalid >>>>>>>> cases, all operations are performed with 4K granularity. The >>>>>>>> possible >>>>>>>> solutions we can think of are either to enable VFIO to support >>>>>>>> partial >>>>>>>> unmap >>>>> >>>>> btw the old VFIO does not split mappings but iommufd seems to be >>>>> capable >>>>> of it - there is iopt_area_split(). What happens if you try >>>>> unmapping a >>>>> smaller chunk that does not exactly match any mapped chunk? thanks, >>>> >>>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make >>>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with >>>> disable_large_page=true. That means the large IOPTE is also disabled in >>>> IOMMU. So it can do the split easily. See the comment in >>>> iommufd_vfio_set_iommu(). >>>> >>>> iommufd VFIO compatible mode is a transition from legacy VFIO to >>>> iommufd. For the normal iommufd, it requires the iova/length must be a >>>> superset of a previously mapped range. If not match, will return error. >>> >>> >>> This is all true but this also means that "The former requires complex >>> changes in VFIO" is not entirely true - some code is already there. >>> Thanks, >> >> Hmm, my statement is a little confusing. The bottleneck is that the >> IOMMU driver doesn't support the large page split. So if we want to >> enable large page and want to do partial unmap, it requires complex >> change. > > We won't need to split large pages (if we stick to 4K for now), we need > to split large mappings (not large pages) to allow partial unmapping and > iopt_area_split() seems to be doing this. Thanks, You mean we can disable large page in iommufd and then VFIO will be able to do partial unmap. Yes, I think it is doable and we can avoid many ioctl context switches overhead. > > >> >>> >>> >>> >> >