Re: [RFC PATCH 0/6] Enable shared device assignment

Rob Nertney <rnertney@xxxxxxxxxx> · Fri, 15 Nov 2024 08:47:22 -0800

On Tue, Oct 08, 2024 at 04:59:45PM +0800, Chenyi Qiang wrote:
> Hi Paolo,
> 
> Kindly ping for this thread. The in-place page conversion is discussed
> at Linux Plumbers. Does it give some direction for shared device
> assignment enabling work?
>
Hi everybody.

Our NVIDIA GPUs currently support this shared-memory/bounce-buffer method to
provide AI acceleration within TEE CVMs. We require passing though the GPU via
VFIO stubbing, which means that we are impacted by the absence of an API to
inform VFIO about page conversions.

The CSPs have enough kernel engineers who handle this process in their own host
kernels, but we have several enterprise customers who are eager to begin using
this solution in the upstream. AMD has successfully ported enough of the
SEV-SNP support into 6.11 and our initial testing shows successful operation,
but only by disabling discard via these two QEMU patches:
- https://github.com/AMDESE/qemu/commit/0c9ae28d3e199de9a40876a492e0f03a11c6f5d8
- https://github.com/AMDESE/qemu/commit/5256c41fb3055961ea7ac368acc0b86a6632d095

This "workaround" is a bit of a hack, as it effectively requires greater than
double the amount of host memory than as to be allocated to the guest CVM. The
proposal here appears to be a promising workaround; are there other solutions
that are recommended for this use case?

This configuration is in GA right now and NVIDIA is committed to support and
test this bounce-buffer mailbox solution for many years into the future, so
we're highly invested in seeing a converged solution in the upstream.

Thanks,
Rob

> Thanks
> Chenyi
> 
> On 8/16/2024 11:02 AM, Chenyi Qiang wrote:
> > Hi Paolo,
> > 
> > Hope to draw your attention. As TEE I/O would depend on shared device
> > assignment and we introduce this RDM solution in QEMU. Now, Observe the
> > in-place private/shared conversion option mentioned by David, do you
> > think we should continue to add pass-thru support for this in-qemu page
> > conversion method? Or wait for the option discussion to see if it will
> > change to in-kernel conversion.
> > 
> > Thanks
> > Chenyi
> > 
> > On 7/25/2024 3:21 PM, Chenyi Qiang wrote:
> >> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> >> discard") effectively disables device assignment with guest_memfd.
> >> guest_memfd is required for confidential guests, so device assignment to
> >> confidential guests is disabled. A supporting assumption for disabling
> >> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO
> >> etc...) solves the confidential-guest device-assignment problem [1].
> >> That turns out not to be the case because TEE I/O depends on being able
> >> to operate devices against "shared"/untrusted memory for device
> >> initialization and error recovery scenarios.
> >>
> >> This series utilizes an existing framework named RamDiscardManager to
> >> notify VFIO of page conversions. However, there's still one concern
> >> related to the semantics of RamDiscardManager which is used to manage
> >> the memory plug/unplug state. This is a little different from the memory
> >> shared/private in our requirement. See the "Open" section below for more
> >> details.
> >>
> >> Background
> >> ==========
> >> Confidential VMs have two classes of memory: shared and private memory.
> >> Shared memory is accessible from the host/VMM while private memory is
> >> not. Confidential VMs can decide which memory is shared/private and
> >> convert memory between shared/private at runtime.
> >>
> >> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> >> private memory. The key differences between guest_memfd and normal memfd
> >> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
> >> cannot be mapped, read or written by userspace.
> >>
> >> In QEMU's implementation, shared memory is allocated with normal methods
> >> (e.g. mmap or fallocate) while private memory is allocated from
> >> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
> >> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
> >> allocates new pages from the other side.
> >>
> >> Problem
> >> =======
> >> Device assignment in QEMU is implemented via VFIO system. In the normal
> >> VM, VM memory is pinned at the beginning of time by VFIO. In the
> >> confidential VM, the VM can convert memory and when that happens
> >> nothing currently tells VFIO that its mappings are stale. This means
> >> that page conversion leaks memory and leaves stale IOMMU mappings. For
> >> example, sequence like the following can result in stale IOMMU mappings:
> >>
> >> 1. allocate shared page
> >> 2. convert page shared->private
> >> 3. discard shared page
> >> 4. convert page private->shared
> >> 5. allocate shared page
> >> 6. issue DMA operations against that shared page
> >>
> >> After step 3, VFIO is still pinning the page. However, DMA operations in
> >> step 6 will hit the old mapping that was allocated in step 1, which
> >> causes the device to access the invalid data.
> >>
> >> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> >> uncoordinated discard") has blocked the device assignment with
> >> guest_memfd to avoid this problem.
> >>
> >> Solution
> >> ========
> >> The key to enable shared device assignment is to solve the stale IOMMU
> >> mappings problem.
> >>
> >> Given the constraints and assumptions here is a solution that satisfied
> >> the use cases. RamDiscardManager, an existing interface currently
> >> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> >> accordance with VM page assignment. Page conversion is similar to
> >> hot-removing a page in one mode and adding it back in the other.
> >>
> >> This series implements a RamDiscardManager for confidential VMs and
> >> utilizes its infrastructure to notify VFIO of page conversions.
> >>
> >> Another possible attempt [2] was to not discard shared pages in step 3
> >> above. This was an incomplete band-aid because guests would consume
> >> twice the memory since shared pages wouldn't be freed even after they
> >> were converted to private.
> >>
> >> Open
> >> ====
> >> Implementing a RamDiscardManager to notify VFIO of page conversions
> >> causes changes in semantics: private memory is treated as discarded (or
> >> hot-removed) memory. This isn't aligned with the expectation of current
> >> RamDiscardManager users (e.g. VFIO or live migration) who really
> >> expect that discarded memory is hot-removed and thus can be skipped when
> >> the users are processing guest memory. Treating private memory as
> >> discarded won't work in future if VFIO or live migration needs to handle
> >> private memory. e.g. VFIO may need to map private memory to support
> >> Trusted IO and live migration for confidential VMs need to migrate
> >> private memory.
> >>
> >> There are two possible ways to mitigate the semantics changes.
> >> 1. Develop a new mechanism to notify the page conversions between
> >> private and shared. For example, utilize the notifier_list in QEMU. VFIO
> >> registers its own handler and gets notified upon page conversions. This
> >> is a clean approach which only touches the notifier workflow. A
> >> challenge is that for device hotplug, existing shared memory should be
> >> mapped in IOMMU. This will need additional changes.
> >>
> >> 2. Extend the existing RamDiscardManager interface to manage not only
> >> the discarded/populated status of guest memory but also the
> >> shared/private status. RamDiscardManager users like VFIO will be
> >> notified with one more argument indicating what change is happening and
> >> can take action accordingly. It also has challenges e.g. QEMU allows
> >> only one RamDiscardManager, how to support virtio-mem for confidential
> >> VMs would be a problem. And some APIs like .is_populated() exposed by
> >> RamDiscardManager are meaningless to shared/private memory. So they may
> >> need some adjustments.
> >>
> >> Testing
> >> =======
> >> This patch series is tested based on the internal TDX KVM/QEMU tree.
> >>
> >> To facilitate shared device assignment with the NIC, employ the legacy
> >> type1 VFIO with the QEMU command:
> >>
> >> qemu-system-x86_64 [...]
> >>     -device vfio-pci,host=XX:XX.X
> >>
> >> The parameter of dma_entry_limit needs to be adjusted. For example, a
> >> 16GB guest needs to adjust the parameter like
> >> vfio_iommu_type1.dma_entry_limit=4194304.
> >>
> >> If use the iommufd-backed VFIO with the qemu command:
> >>
> >> qemu-system-x86_64 [...]
> >>     -object iommufd,id=iommufd0 \
> >>     -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
> >>
> >> No additional adjustment required.
> >>
> >> Following the bootup of the TD guest, the guest's IP address becomes
> >> visible, and iperf is able to successfully send and receive data.
> >>
> >> Related link
> >> ============
> >> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@xxxxxxxxxx/
> >> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@xxxxxxx/
> >>
> >> Chenyi Qiang (6):
> >>   guest_memfd: Introduce an object to manage the guest-memfd with
> >>     RamDiscardManager
> >>   guest_memfd: Introduce a helper to notify the shared/private state
> >>     change
> >>   KVM: Notify the state change via RamDiscardManager helper during
> >>     shared/private conversion
> >>   memory: Register the RamDiscardManager instance upon guest_memfd
> >>     creation
> >>   guest-memfd: Default to discarded (private) in guest_memfd_manager
> >>   RAMBlock: make guest_memfd require coordinate discard
> >>
> >>  accel/kvm/kvm-all.c                  |   7 +
> >>  include/sysemu/guest-memfd-manager.h |  49 +++
> >>  system/guest-memfd-manager.c         | 425 +++++++++++++++++++++++++++
> >>  system/meson.build                   |   1 +
> >>  system/physmem.c                     |  11 +-
> >>  5 files changed, 492 insertions(+), 1 deletion(-)
> >>  create mode 100644 include/sysemu/guest-memfd-manager.h
> >>  create mode 100644 system/guest-memfd-manager.c
> >>
> >>
> >> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819
> 
>