Re: [PATCH 0/7] Enable shared device assignment

Chenyi Qiang <chenyi.qiang@xxxxxxxxx> · Fri, 10 Jan 2025 15:06:05 +0800

On 1/10/2025 9:42 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 19:49, Chenyi Qiang wrote:
>>
>>
>> On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 9/1/25 18:52, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> On 8/1/25 17:28, Chenyi Qiang wrote:
>>>>>> Thanks Alexey for your review!
>>>>>>
>>>>>> On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>> uncoordinated
>>>>>>>> discard") effectively disables device assignment when using
>>>>>>>> guest_memfd.
>>>>>>>> This poses a significant challenge as guest_memfd is essential for
>>>>>>>> confidential guests, thereby blocking device assignment to these
>>>>>>>> VMs.
>>>>>>>> The initial rationale for disabling device assignment was due to
>>>>>>>> stale
>>>>>>>> IOMMU mappings (see Problem section) and the assumption that TEE
>>>>>>>> I/O
>>>>>>>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-
>>>>>>>> assignment
>>>>>>>> problem for confidential guests [1]. However, this assumption has
>>>>>>>> proven
>>>>>>>> to be incorrect. TEE I/O relies on the ability to operate devices
>>>>>>>> against
>>>>>>>> "shared" or untrusted memory, which is crucial for device
>>>>>>>> initialization
>>>>>>>> and error recovery scenarios. As a result, the current
>>>>>>>> implementation
>>>>>>>> does
>>>>>>>> not adequately support device assignment for confidential guests,
>>>>>>>> necessitating
>>>>>>>> a reevaluation of the approach to ensure compatibility and
>>>>>>>> functionality.
>>>>>>>>
>>>>>>>> This series enables shared device assignment by notifying VFIO of
>>>>>>>> page
>>>>>>>> conversions using an existing framework named RamDiscardListener.
>>>>>>>> Additionally, there is an ongoing patch set [2] that aims to add 1G
>>>>>>>> page
>>>>>>>> support for guest_memfd. This patch set introduces in-place page
>>>>>>>> conversion,
>>>>>>>> where private and shared memory share the same physical pages as
>>>>>>>> the
>>>>>>>> backend.
>>>>>>>> This development may impact our solution.
>>>>>>>>
>>>>>>>> We presented our solution in the guest_memfd meeting to discuss its
>>>>>>>> compatibility with the new changes and potential future directions
>>>>>>>> (see [3]
>>>>>>>> for more details). The conclusion was that, although our
>>>>>>>> solution may
>>>>>>>> not be
>>>>>>>> the most elegant (see the Limitation section), it is sufficient for
>>>>>>>> now and
>>>>>>>> can be easily adapted to future changes.
>>>>>>>>
>>>>>>>> We are re-posting the patch series with some cleanup and have
>>>>>>>> removed
>>>>>>>> the RFC
>>>>>>>> label for the main enabling patches (1-6). The newly-added patch
>>>>>>>> 7 is
>>>>>>>> still
>>>>>>>> marked as RFC as it tries to resolve some extension concerns
>>>>>>>> related to
>>>>>>>> RamDiscardManager for future usage.
>>>>>>>>
>>>>>>>> The overview of the patches:
>>>>>>>> - Patch 1: Export a helper to get intersection of a
>>>>>>>> MemoryRegionSection
>>>>>>>>       with a given range.
>>>>>>>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>>>>>>>       RamDiscardManager, and notify the shared/private state change
>>>>>>>> during
>>>>>>>>       conversion.
>>>>>>>> - Patch 7: Try to resolve a semantics concern related to
>>>>>>>> RamDiscardManager
>>>>>>>>       i.e. RamDiscardManager is used to manage memory plug/unplug
>>>>>>>> state
>>>>>>>>       instead of shared/private state. It would affect future
>>>>>>>> users of
>>>>>>>>       RamDiscardManger in confidential VMs. Attach it behind as
>>>>>>>> a RFC
>>>>>>>> patch[4].
>>>>>>>>
>>>>>>>> Changes since last version:
>>>>>>>> - Add a patch to export some generic helper functions from
>>>>>>>> virtio-mem
>>>>>>>> code.
>>>>>>>> - Change the bitmap in guest_memfd_manager from default shared to
>>>>>>>> default
>>>>>>>>       private. This keeps alignment with virtio-mem that 1-
>>>>>>>> setting in
>>>>>>>> bitmap
>>>>>>>>       represents the populated state and may help to export more
>>>>>>>> generic
>>>>>>>> code
>>>>>>>>       if necessary.
>>>>>>>> - Add the helpers to initialize/uninitialize the
>>>>>>>> guest_memfd_manager
>>>>>>>> instance
>>>>>>>>       to make it more clear.
>>>>>>>> - Add a patch to distinguish between the shared/private state
>>>>>>>> change
>>>>>>>> and
>>>>>>>>       the memory plug/unplug state change in RamDiscardManager.
>>>>>>>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>>>>>>>> chenyi.qiang@xxxxxxxxx/
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Background
>>>>>>>> ==========
>>>>>>>> Confidential VMs have two classes of memory: shared and private
>>>>>>>> memory.
>>>>>>>> Shared memory is accessible from the host/VMM while private
>>>>>>>> memory is
>>>>>>>> not. Confidential VMs can decide which memory is shared/private and
>>>>>>>> convert memory between shared/private at runtime.
>>>>>>>>
>>>>>>>> "guest_memfd" is a new kind of fd whose primary goal is to serve
>>>>>>>> guest
>>>>>>>> private memory. The key differences between guest_memfd and normal
>>>>>>>> memfd
>>>>>>>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner
>>>>>>>> VM and
>>>>>>>> cannot be mapped, read or written by userspace.
>>>>>>>
>>>>>>> The "cannot be mapped" seems to be not true soon anymore (if not
>>>>>>> already).
>>>>>>>
>>>>>>> https://lore.kernel.org/all/20240801090117.3841080-1-
>>>>>>> tabba@xxxxxxxxxx/T/
>>>>>>
>>>>>> Exactly, allowing guest_memfd to do mmap is the direction. I
>>>>>> mentioned
>>>>>> it below with in-place page conversion. Maybe I would move it here to
>>>>>> make it more clear.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> In QEMU's implementation, shared memory is allocated with normal
>>>>>>>> methods
>>>>>>>> (e.g. mmap or fallocate) while private memory is allocated from
>>>>>>>> guest_memfd. When a VM performs memory conversions, QEMU frees
>>>>>>>> pages
>>>>>>>> via
>>>>>>>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one
>>>>>>>> side and
>>>>>>>> allocates new pages from the other side.
>>>>>>>>
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>>>
>>>>>>>> One limitation (also discussed in the guest_memfd meeting) is that
>>>>>>>> VFIO
>>>>>>>> expects the DMA mapping for a specific IOVA to be mapped and
>>>>>>>> unmapped
>>>>>>>> with
>>>>>>>> the same granularity. The guest may perform partial conversions,
>>>>>>>> such as
>>>>>>>> converting a small region within a larger region. To prevent such
>>>>>>>> invalid
>>>>>>>> cases, all operations are performed with 4K granularity. The
>>>>>>>> possible
>>>>>>>> solutions we can think of are either to enable VFIO to support
>>>>>>>> partial
>>>>>>>> unmap
>>>>>
>>>>> btw the old VFIO does not split mappings but iommufd seems to be
>>>>> capable
>>>>> of it - there is iopt_area_split(). What happens if you try
>>>>> unmapping a
>>>>> smaller chunk that does not exactly match any mapped chunk? thanks,
>>>>
>>>> iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
>>>> iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
>>>> disable_large_page=true. That means the large IOPTE is also disabled in
>>>> IOMMU. So it can do the split easily. See the comment in
>>>> iommufd_vfio_set_iommu().
>>>>
>>>> iommufd VFIO compatible mode is a transition from legacy VFIO to
>>>> iommufd. For the normal iommufd, it requires the iova/length must be a
>>>> superset of a previously mapped range. If not match, will return error.
>>>
>>>
>>> This is all true but this also means that "The former requires complex
>>> changes in VFIO" is not entirely true - some code is already there.
>>> Thanks,
>>
>> Hmm, my statement is a little confusing.  The bottleneck is that the
>> IOMMU driver doesn't support the large page split. So if we want to
>> enable large page and want to do partial unmap, it requires complex
>> change.
> 
> We won't need to split large pages (if we stick to 4K for now), we need
> to split large mappings (not large pages) to allow partial unmapping and
> iopt_area_split() seems to be doing this. Thanks,

You mean we can disable large page in iommufd and then VFIO will be able
to do partial unmap. Yes, I think it is doable and we can avoid many
ioctl context switches overhead.

> 
> 
>>
>>>
>>>
>>>
>>
>