Re: [PATCH 0/7] Enable shared device assignment

Chenyi Qiang <chenyi.qiang@xxxxxxxxx> · Thu, 9 Jan 2025 15:52:42 +0800

On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 8/1/25 17:28, Chenyi Qiang wrote:
>> Thanks Alexey for your review!
>>
>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>> discard") effectively disables device assignment when using
>>>> guest_memfd.
>>>> This poses a significant challenge as guest_memfd is essential for
>>>> confidential guests, thereby blocking device assignment to these VMs.
>>>> The initial rationale for disabling device assignment was due to stale
>>>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
>>>> problem for confidential guests [1]. However, this assumption has
>>>> proven
>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>> against
>>>> "shared" or untrusted memory, which is crucial for device
>>>> initialization
>>>> and error recovery scenarios. As a result, the current implementation
>>>> does
>>>> not adequately support device assignment for confidential guests,
>>>> necessitating
>>>> a reevaluation of the approach to ensure compatibility and
>>>> functionality.
>>>>
>>>> This series enables shared device assignment by notifying VFIO of page
>>>> conversions using an existing framework named RamDiscardListener.
>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>> page
>>>> support for guest_memfd. This patch set introduces in-place page
>>>> conversion,
>>>> where private and shared memory share the same physical pages as the
>>>> backend.
>>>> This development may impact our solution.
>>>>
>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>> compatibility with the new changes and potential future directions
>>>> (see [3]
>>>> for more details). The conclusion was that, although our solution may
>>>> not be
>>>> the most elegant (see the Limitation section), it is sufficient for
>>>> now and
>>>> can be easily adapted to future changes.
>>>>
>>>> We are re-posting the patch series with some cleanup and have removed
>>>> the RFC
>>>> label for the main enabling patches (1-6). The newly-added patch 7 is
>>>> still
>>>> marked as RFC as it tries to resolve some extension concerns related to
>>>> RamDiscardManager for future usage.
>>>>
>>>> The overview of the patches:
>>>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>>>>     with a given range.
>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>     RamDiscardManager, and notify the shared/private state change
>>>> during
>>>>     conversion.
>>>> - Patch 7: Try to resolve a semantics concern related to
>>>> RamDiscardManager
>>>>     i.e. RamDiscardManager is used to manage memory plug/unplug state
>>>>     instead of shared/private state. It would affect future users of
>>>>     RamDiscardManger in confidential VMs. Attach it behind as a RFC
>>>> patch[4].
>>>>
>>>> Changes since last version:
>>>> - Add a patch to export some generic helper functions from virtio-mem
>>>> code.
>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>> default
>>>>     private. This keeps alignment with virtio-mem that 1-setting in
>>>> bitmap
>>>>     represents the populated state and may help to export more generic
>>>> code
>>>>     if necessary.
>>>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>>>> instance
>>>>     to make it more clear.
>>>> - Add a patch to distinguish between the shared/private state change
>>>> and
>>>>     the memory plug/unplug state change in RamDiscardManager.
>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>> chenyi.qiang@xxxxxxxxx/
>>>>
>>>> ---
>>>>
>>>> Background
>>>> ==========
>>>> Confidential VMs have two classes of memory: shared and private memory.
>>>> Shared memory is accessible from the host/VMM while private memory is
>>>> not. Confidential VMs can decide which memory is shared/private and
>>>> convert memory between shared/private at runtime.
>>>>
>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>>>> private memory. The key differences between guest_memfd and normal
>>>> memfd
>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>> VM and
>>>> cannot be mapped, read or written by userspace.
>>>
>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>> already).
>>>
>>> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@xxxxxxxxxx/T/
>>
>> Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
>> it below with in-place page conversion. Maybe I would move it here to
>> make it more clear.
>>
>>>
>>>
>>>>
>>>> In QEMU's implementation, shared memory is allocated with normal
>>>> methods
>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>> guest_memfd. When a VM performs memory conversions, QEMU frees pages
>>>> via
>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>>>> allocates new pages from the other side.
>>>>
>>
>> [...]
>>
>>>>
>>>> One limitation (also discussed in the guest_memfd meeting) is that VFIO
>>>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>>>> with
>>>> the same granularity. The guest may perform partial conversions,
>>>> such as
>>>> converting a small region within a larger region. To prevent such
>>>> invalid
>>>> cases, all operations are performed with 4K granularity. The possible
>>>> solutions we can think of are either to enable VFIO to support partial
>>>> unmap
> 
> btw the old VFIO does not split mappings but iommufd seems to be capable
> of it - there is iopt_area_split(). What happens if you try unmapping a
> smaller chunk that does not exactly match any mapped chunk? thanks,

iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
disable_large_page=true. That means the large IOPTE is also disabled in
IOMMU. So it can do the split easily. See the comment in
iommufd_vfio_set_iommu().

iommufd VFIO compatible mode is a transition from legacy VFIO to
iommufd. For the normal iommufd, it requires the iova/length must be a
superset of a previously mapped range. If not match, will return error.

> 
>