Re: [RFC PATCH 0/6] Enable shared device assignment

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Open
====
Implementing a RamDiscardManager to notify VFIO of page conversions
causes changes in semantics: private memory is treated as discarded (or
hot-removed) memory. This isn't aligned with the expectation of current
RamDiscardManager users (e.g. VFIO or live migration) who really
expect that discarded memory is hot-removed and thus can be skipped when
the users are processing guest memory. Treating private memory as
discarded won't work in future if VFIO or live migration needs to handle
private memory. e.g. VFIO may need to map private memory to support
Trusted IO and live migration for confidential VMs need to migrate
private memory.

"VFIO may need to map private memory to support Trusted IO"

I've been told that the way we handle shared memory won't be the way this is going to work with guest_memfd. KVM will coordinate directly with VFIO or $whatever and update the IOMMU tables itself right in the kernel; the pages are pinned/owned by guest_memfd, so that will just work. So I don't consider that currently a concern. guest_memfd private memory is not mapped into user page tables and as it currently seems it never will be.

Similarly: live migration. We cannot simply migrate that memory the traditional way. We even have to track the dirty state differently.

So IMHO, treating both memory as discarded == don't touch it the usual way might actually be a feature not a bug ;)


There are two possible ways to mitigate the semantics changes.
1. Develop a new mechanism to notify the page conversions between
private and shared. For example, utilize the notifier_list in QEMU. VFIO
registers its own handler and gets notified upon page conversions. This
is a clean approach which only touches the notifier workflow. A
challenge is that for device hotplug, existing shared memory should be
mapped in IOMMU. This will need additional changes.

2. Extend the existing RamDiscardManager interface to manage not only
the discarded/populated status of guest memory but also the
shared/private status. RamDiscardManager users like VFIO will be
notified with one more argument indicating what change is happening and
can take action accordingly. It also has challenges e.g. QEMU allows
only one RamDiscardManager, how to support virtio-mem for confidential
VMs would be a problem. And some APIs like .is_populated() exposed by
RamDiscardManager are meaningless to shared/private memory. So they may
need some adjustments.

Think of all of that in terms of "shared memory is populated, private memory is some inaccessible stuff that needs very special way and other means for device assignment, live migration, etc.". Then it actually quite makes sense to use of RamDiscardManager (AFAIKS :) ).


Testing
=======
This patch series is tested based on the internal TDX KVM/QEMU tree.

To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:

qemu-system-x86_64 [...]
     -device vfio-pci,host=XX:XX.X

The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.

But here you note the biggest real issue I see (not related to RAMDiscardManager, but that we have to prepare for conversion of each possible private page to shared and back): we need a single IOMMU mapping for each 4 KiB page.

Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB. Does it even scale then?


There is the alternative of having in-place private/shared conversion when we also let guest_memfd manage some shared memory. It has plenty of downsides, but for the problem at hand it would mean that we don't discard on shared/private conversion.

But whenever we want to convert memory shared->private we would similarly have to from IOMMU page tables via VFIO. (the in-place conversion will only be allowed if any additional references on a page are gone -- when it is inaccessible by userspace/kernel).

Again, if IOMMU page tables would be managed by KVM in the kernel without user space intervention/vfio this would work with device assignment just fine. But I guess it will take a while until we actually have that option.

--
Cheers,

David / dhildenb





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux