Re: [PATCH 0/7] Enable shared device assignment

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks Alexey for your review!

On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
> On 13/12/24 18:08, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") effectively disables device assignment when using guest_memfd.
>> This poses a significant challenge as guest_memfd is essential for
>> confidential guests, thereby blocking device assignment to these VMs.
>> The initial rationale for disabling device assignment was due to stale
>> IOMMU mappings (see Problem section) and the assumption that TEE I/O
>> (SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-assignment
>> problem for confidential guests [1]. However, this assumption has proven
>> to be incorrect. TEE I/O relies on the ability to operate devices against
>> "shared" or untrusted memory, which is crucial for device initialization
>> and error recovery scenarios. As a result, the current implementation
>> does
>> not adequately support device assignment for confidential guests,
>> necessitating
>> a reevaluation of the approach to ensure compatibility and functionality.
>>
>> This series enables shared device assignment by notifying VFIO of page
>> conversions using an existing framework named RamDiscardListener.
>> Additionally, there is an ongoing patch set [2] that aims to add 1G page
>> support for guest_memfd. This patch set introduces in-place page
>> conversion,
>> where private and shared memory share the same physical pages as the
>> backend.
>> This development may impact our solution.
>>
>> We presented our solution in the guest_memfd meeting to discuss its
>> compatibility with the new changes and potential future directions
>> (see [3]
>> for more details). The conclusion was that, although our solution may
>> not be
>> the most elegant (see the Limitation section), it is sufficient for
>> now and
>> can be easily adapted to future changes.
>>
>> We are re-posting the patch series with some cleanup and have removed
>> the RFC
>> label for the main enabling patches (1-6). The newly-added patch 7 is
>> still
>> marked as RFC as it tries to resolve some extension concerns related to
>> RamDiscardManager for future usage.
>>
>> The overview of the patches:
>> - Patch 1: Export a helper to get intersection of a MemoryRegionSection
>>    with a given range.
>> - Patch 2-6: Introduce a new object to manage the guest-memfd with
>>    RamDiscardManager, and notify the shared/private state change during
>>    conversion.
>> - Patch 7: Try to resolve a semantics concern related to
>> RamDiscardManager
>>    i.e. RamDiscardManager is used to manage memory plug/unplug state
>>    instead of shared/private state. It would affect future users of
>>    RamDiscardManger in confidential VMs. Attach it behind as a RFC
>> patch[4].
>>
>> Changes since last version:
>> - Add a patch to export some generic helper functions from virtio-mem
>> code.
>> - Change the bitmap in guest_memfd_manager from default shared to default
>>    private. This keeps alignment with virtio-mem that 1-setting in bitmap
>>    represents the populated state and may help to export more generic
>> code
>>    if necessary.
>> - Add the helpers to initialize/uninitialize the guest_memfd_manager
>> instance
>>    to make it more clear.
>> - Add a patch to distinguish between the shared/private state change and
>>    the memory plug/unplug state change in RamDiscardManager.
>> - RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
>> chenyi.qiang@xxxxxxxxx/
>>
>> ---
>>
>> Background
>> ==========
>> Confidential VMs have two classes of memory: shared and private memory.
>> Shared memory is accessible from the host/VMM while private memory is
>> not. Confidential VMs can decide which memory is shared/private and
>> convert memory between shared/private at runtime.
>>
>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>> private memory. The key differences between guest_memfd and normal memfd
>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
>> cannot be mapped, read or written by userspace.
> 
> The "cannot be mapped" seems to be not true soon anymore (if not already).
> 
> https://lore.kernel.org/all/20240801090117.3841080-1-tabba@xxxxxxxxxx/T/

Exactly, allowing guest_memfd to do mmap is the direction. I mentioned
it below with in-place page conversion. Maybe I would move it here to
make it more clear.

> 
> 
>>
>> In QEMU's implementation, shared memory is allocated with normal methods
>> (e.g. mmap or fallocate) while private memory is allocated from
>> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>> allocates new pages from the other side.
>>

[...]

>>
>> One limitation (also discussed in the guest_memfd meeting) is that VFIO
>> expects the DMA mapping for a specific IOVA to be mapped and unmapped
>> with
>> the same granularity. The guest may perform partial conversions, such as
>> converting a small region within a larger region. To prevent such invalid
>> cases, all operations are performed with 4K granularity. The possible
>> solutions we can think of are either to enable VFIO to support partial
>> unmap
>> or to implement an enlightened guest to avoid partial conversion. The
>> former
>> requires complex changes in VFIO, while the latter requires the page
>> conversion to be a guest-enlightened behavior. It is still uncertain
>> which
>> option is a preferred one.
> 
> in-place memory conversion is :)
> 
>>
>> Testing
>> =======
>> This patch series is tested with the KVM/QEMU branch:
>> KVM: https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20
>> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-
>> snapshot-2024-12-13
> 
> 
> The branch is gone now? tdx-upstream-snapshot-2024-12-18 seems to have
> these though. Thanks,

Thanks for pointing it out. You're right,
tdx-upstream-snapshot-2024-12-18 is the latest branch. I added the fixup
for patch 1 and forgot to update the change here.

> 
>>
>> To facilitate shared device assignment with the NIC, employ the legacy
>> type1 VFIO with the QEMU command:
>>
>> qemu-system-x86_64 [...]
>>      -device vfio-pci,host=XX:XX.X
>>
>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>> 16GB guest needs to adjust the parameter like
>> vfio_iommu_type1.dma_entry_limit=4194304.
>>
>> If use the iommufd-backed VFIO with the qemu command:
>>
>> qemu-system-x86_64 [...]
>>      -object iommufd,id=iommufd0 \
>>      -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
>>
>> No additional adjustment required.
>>
>> Following the bootup of the TD guest, the guest's IP address becomes
>> visible, and iperf is able to successfully send and receive data.

> 





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux