Re: [RFC PATCH 0/6] Enable shared device assignment

David Hildenbrand <david@xxxxxxxxxx> · Fri, 26 Jul 2024 09:20:10 +0200

On 26.07.24 08:20, Chenyi Qiang wrote:

On 7/25/2024 10:04 PM, David Hildenbrand wrote:
Open
====
Implementing a RamDiscardManager to notify VFIO of page conversions
causes changes in semantics: private memory is treated as discarded (or
hot-removed) memory. This isn't aligned with the expectation of current
RamDiscardManager users (e.g. VFIO or live migration) who really
expect that discarded memory is hot-removed and thus can be skipped when
the users are processing guest memory. Treating private memory as
discarded won't work in future if VFIO or live migration needs to handle
private memory. e.g. VFIO may need to map private memory to support
Trusted IO and live migration for confidential VMs need to migrate
private memory.

"VFIO may need to map private memory to support Trusted IO"

I've been told that the way we handle shared memory won't be the way
this is going to work with guest_memfd. KVM will coordinate directly
with VFIO or $whatever and update the IOMMU tables itself right in the
kernel; the pages are pinned/owned by guest_memfd, so that will just
work. So I don't consider that currently a concern. guest_memfd private
memory is not mapped into user page tables and as it currently seems it
never will be.

That's correct. AFAIK, some TEE IO solution like TDX Connect would let
kernel coordinate and update private mapping in IOMMU tables. Here, It
mentions that VFIO "may" need map private memory. I want to make this
more generic to account for potential future TEE IO solutions that may
require such functionality. :)

Careful to not over-enginner something that is not even real or 
close-to-be-real yet, though. :) Nobody really knows who that will look 
like, besides that we know for Intel that we won't need that.

Similarly: live migration. We cannot simply migrate that memory the
traditional way. We even have to track the dirty state differently.

So IMHO, treating both memory as discarded == don't touch it the usual
way might actually be a feature not a bug ;)

Do you mean treating the private memory in both VFIO and live migration
as discarded? That is what this patch series does. And as you mentioned,
these RDM users cannot follow the traditional RDM way. Because of this,
we also considered whether we should use RDM or a more generic mechanism
like notifier_list below.

Yes, the shared memory is logically discarded. At the same time we 
*might* get private memory effectively populated. See my reply to Kevin 
that there might be ways of having shared vs. private populate/discard 
in the future, if required. Just some idea, though.

There are two possible ways to mitigate the semantics changes.
1. Develop a new mechanism to notify the page conversions between
private and shared. For example, utilize the notifier_list in QEMU. VFIO
registers its own handler and gets notified upon page conversions. This
is a clean approach which only touches the notifier workflow. A
challenge is that for device hotplug, existing shared memory should be
mapped in IOMMU. This will need additional changes.

2. Extend the existing RamDiscardManager interface to manage not only
the discarded/populated status of guest memory but also the
shared/private status. RamDiscardManager users like VFIO will be
notified with one more argument indicating what change is happening and
can take action accordingly. It also has challenges e.g. QEMU allows
only one RamDiscardManager, how to support virtio-mem for confidential
VMs would be a problem. And some APIs like .is_populated() exposed by
RamDiscardManager are meaningless to shared/private memory. So they may
need some adjustments.

Think of all of that in terms of "shared memory is populated, private
memory is some inaccessible stuff that needs very special way and other
means for device assignment, live migration, etc.". Then it actually
quite makes sense to use of RamDiscardManager (AFAIKS :) ).

Yes, such notification mechanism is what we want. But for the users of
RDM, it would require additional change accordingly. Current users just
skip inaccessible stuff, but in private memory case, it can't be simply
skipped. Maybe renaming RamDiscardManager to RamStateManager is more
accurate then. :)

Current users must skip it, yes. How private memory would have to be 
handled, and who would handle it, is rather unclear.

Again, maybe we'd want separate RamDiscardManager for private and shared 
memory (after all, these are two separate memory backends).

Not sure that "RamStateManager" terminology would be reasonable in that 
approach.

Testing
=======
This patch series is tested based on the internal TDX KVM/QEMU tree.

To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:

qemu-system-x86_64 [...]
      -device vfio-pci,host=XX:XX.X

The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.

But here you note the biggest real issue I see (not related to
RAMDiscardManager, but that we have to prepare for conversion of each
possible private page to shared and back): we need a single IOMMU
mapping for each 4 KiB page.

Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB.
Does it even scale then?

The entry limitation needs to be increased as the guest memory size
increases. For this issue, are you concerned that having too many
entries might bring some performance issue? Maybe we could introduce
some PV mechanism to coordinate with guest to convert memory only in 2M
granularity. This may help mitigate the problem.

I've had this talk with Intel, because the 4K granularity is a pain. I 
was told that ship has sailed ... and we have to cope with random 4K 
conversions :(

The many mappings will likely add both memory and runtime overheads in 
the kernel. But we only know once we measure.

Key point is that even 4194304 "only" allows for 16 GiB. Imagine 1 TiB 
of shared memory :/

There is the alternative of having in-place private/shared conversion
when we also let guest_memfd manage some shared memory. It has plenty of
downsides, but for the problem at hand it would mean that we don't
discard on shared/private conversion.>
But whenever we want to convert memory shared->private we would
similarly have to from IOMMU page tables via VFIO. (the in-place
conversion will only be allowed if any additional references on a page
are gone -- when it is inaccessible by userspace/kernel).

I'm not clear about this in-place private/shared conversion. Can you
elaborate a little bit? It seems this alternative changes private and
shared management in current guest_memfd?

Yes, there have been discussions about that, also in the context of 
supporting huge pages while allowing for the guest to still convert 
individual 4K chunks ...

A summary is here [1]. Likely more things will be covered at Linux Plumbers.

[1] 
https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@xxxxxxxxxx/

--
Cheers,

David / dhildenb