Re: [RFC PATCH 0/6] Enable shared device assignment

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Paolo,

Kindly ping for this thread. The in-place page conversion is discussed
at Linux Plumbers. Does it give some direction for shared device
assignment enabling work?

Thanks
Chenyi

On 8/16/2024 11:02 AM, Chenyi Qiang wrote:
> Hi Paolo,
> 
> Hope to draw your attention. As TEE I/O would depend on shared device
> assignment and we introduce this RDM solution in QEMU. Now, Observe the
> in-place private/shared conversion option mentioned by David, do you
> think we should continue to add pass-thru support for this in-qemu page
> conversion method? Or wait for the option discussion to see if it will
> change to in-kernel conversion.
> 
> Thanks
> Chenyi
> 
> On 7/25/2024 3:21 PM, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") effectively disables device assignment with guest_memfd.
>> guest_memfd is required for confidential guests, so device assignment to
>> confidential guests is disabled. A supporting assumption for disabling
>> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO
>> etc...) solves the confidential-guest device-assignment problem [1].
>> That turns out not to be the case because TEE I/O depends on being able
>> to operate devices against "shared"/untrusted memory for device
>> initialization and error recovery scenarios.
>>
>> This series utilizes an existing framework named RamDiscardManager to
>> notify VFIO of page conversions. However, there's still one concern
>> related to the semantics of RamDiscardManager which is used to manage
>> the memory plug/unplug state. This is a little different from the memory
>> shared/private in our requirement. See the "Open" section below for more
>> details.
>>
>> Background
>> ==========
>> Confidential VMs have two classes of memory: shared and private memory.
>> Shared memory is accessible from the host/VMM while private memory is
>> not. Confidential VMs can decide which memory is shared/private and
>> convert memory between shared/private at runtime.
>>
>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>> private memory. The key differences between guest_memfd and normal memfd
>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
>> cannot be mapped, read or written by userspace.
>>
>> In QEMU's implementation, shared memory is allocated with normal methods
>> (e.g. mmap or fallocate) while private memory is allocated from
>> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>> allocates new pages from the other side.
>>
>> Problem
>> =======
>> Device assignment in QEMU is implemented via VFIO system. In the normal
>> VM, VM memory is pinned at the beginning of time by VFIO. In the
>> confidential VM, the VM can convert memory and when that happens
>> nothing currently tells VFIO that its mappings are stale. This means
>> that page conversion leaks memory and leaves stale IOMMU mappings. For
>> example, sequence like the following can result in stale IOMMU mappings:
>>
>> 1. allocate shared page
>> 2. convert page shared->private
>> 3. discard shared page
>> 4. convert page private->shared
>> 5. allocate shared page
>> 6. issue DMA operations against that shared page
>>
>> After step 3, VFIO is still pinning the page. However, DMA operations in
>> step 6 will hit the old mapping that was allocated in step 1, which
>> causes the device to access the invalid data.
>>
>> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>> uncoordinated discard") has blocked the device assignment with
>> guest_memfd to avoid this problem.
>>
>> Solution
>> ========
>> The key to enable shared device assignment is to solve the stale IOMMU
>> mappings problem.
>>
>> Given the constraints and assumptions here is a solution that satisfied
>> the use cases. RamDiscardManager, an existing interface currently
>> utilized by virtio-mem, offers a means to modify IOMMU mappings in
>> accordance with VM page assignment. Page conversion is similar to
>> hot-removing a page in one mode and adding it back in the other.
>>
>> This series implements a RamDiscardManager for confidential VMs and
>> utilizes its infrastructure to notify VFIO of page conversions.
>>
>> Another possible attempt [2] was to not discard shared pages in step 3
>> above. This was an incomplete band-aid because guests would consume
>> twice the memory since shared pages wouldn't be freed even after they
>> were converted to private.
>>
>> Open
>> ====
>> Implementing a RamDiscardManager to notify VFIO of page conversions
>> causes changes in semantics: private memory is treated as discarded (or
>> hot-removed) memory. This isn't aligned with the expectation of current
>> RamDiscardManager users (e.g. VFIO or live migration) who really
>> expect that discarded memory is hot-removed and thus can be skipped when
>> the users are processing guest memory. Treating private memory as
>> discarded won't work in future if VFIO or live migration needs to handle
>> private memory. e.g. VFIO may need to map private memory to support
>> Trusted IO and live migration for confidential VMs need to migrate
>> private memory.
>>
>> There are two possible ways to mitigate the semantics changes.
>> 1. Develop a new mechanism to notify the page conversions between
>> private and shared. For example, utilize the notifier_list in QEMU. VFIO
>> registers its own handler and gets notified upon page conversions. This
>> is a clean approach which only touches the notifier workflow. A
>> challenge is that for device hotplug, existing shared memory should be
>> mapped in IOMMU. This will need additional changes.
>>
>> 2. Extend the existing RamDiscardManager interface to manage not only
>> the discarded/populated status of guest memory but also the
>> shared/private status. RamDiscardManager users like VFIO will be
>> notified with one more argument indicating what change is happening and
>> can take action accordingly. It also has challenges e.g. QEMU allows
>> only one RamDiscardManager, how to support virtio-mem for confidential
>> VMs would be a problem. And some APIs like .is_populated() exposed by
>> RamDiscardManager are meaningless to shared/private memory. So they may
>> need some adjustments.
>>
>> Testing
>> =======
>> This patch series is tested based on the internal TDX KVM/QEMU tree.
>>
>> To facilitate shared device assignment with the NIC, employ the legacy
>> type1 VFIO with the QEMU command:
>>
>> qemu-system-x86_64 [...]
>>     -device vfio-pci,host=XX:XX.X
>>
>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>> 16GB guest needs to adjust the parameter like
>> vfio_iommu_type1.dma_entry_limit=4194304.
>>
>> If use the iommufd-backed VFIO with the qemu command:
>>
>> qemu-system-x86_64 [...]
>>     -object iommufd,id=iommufd0 \
>>     -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
>>
>> No additional adjustment required.
>>
>> Following the bootup of the TD guest, the guest's IP address becomes
>> visible, and iperf is able to successfully send and receive data.
>>
>> Related link
>> ============
>> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@xxxxxxxxxx/
>> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@xxxxxxx/
>>
>> Chenyi Qiang (6):
>>   guest_memfd: Introduce an object to manage the guest-memfd with
>>     RamDiscardManager
>>   guest_memfd: Introduce a helper to notify the shared/private state
>>     change
>>   KVM: Notify the state change via RamDiscardManager helper during
>>     shared/private conversion
>>   memory: Register the RamDiscardManager instance upon guest_memfd
>>     creation
>>   guest-memfd: Default to discarded (private) in guest_memfd_manager
>>   RAMBlock: make guest_memfd require coordinate discard
>>
>>  accel/kvm/kvm-all.c                  |   7 +
>>  include/sysemu/guest-memfd-manager.h |  49 +++
>>  system/guest-memfd-manager.c         | 425 +++++++++++++++++++++++++++
>>  system/meson.build                   |   1 +
>>  system/physmem.c                     |  11 +-
>>  5 files changed, 492 insertions(+), 1 deletion(-)
>>  create mode 100644 include/sysemu/guest-memfd-manager.h
>>  create mode 100644 system/guest-memfd-manager.c
>>
>>
>> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux