Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Alexey Kardashevskiy <aik@xxxxxxx> · Wed, 15 Jan 2025 15:06:06 +1100

On 10/1/25 17:38, Chenyi Qiang wrote:


On 1/10/2025 8:58 AM, Alexey Kardashevskiy wrote:


On 9/1/25 15:29, Chenyi Qiang wrote:


On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:


On 9/1/25 13:11, Chenyi Qiang wrote:


On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:


On 8/1/25 21:56, Chenyi Qiang wrote:


On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
On 13/12/24 18:08, Chenyi Qiang wrote:
As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
uncoordinated discard") highlighted, some subsystems like VFIO
might
disable ram block discard. However, guest_memfd relies on the
discard
operation to perform page conversion between private and shared
memory.
This can lead to stale IOMMU mapping issue when assigning a
hardware
device to a confidential VM via shared memory (unprotected memory
pages). Blocking shared page discard can solve this problem, but it
could cause guests to consume twice the memory with VFIO, which is
not
acceptable in some cases. An alternative solution is to convey
other
systems like VFIO to refresh its outdated IOMMU mappings.

RamDiscardManager is an existing concept (used by virtio-mem) to
adjust
VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and
adding it
back in the other, so the similar work that needs to happen in
response
to virtio-mem changes needs to happen for page conversion events.
Introduce the RamDiscardManager to guest_memfd to achieve it.

However, guest_memfd is not an object so it cannot directly
implement
the RamDiscardManager interface.

One solution is to implement the interface in HostMemoryBackend.
Any

This sounds about right.

btw I am using this for ages:

https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46

but I am not sure if this ever saw the light of the day, did not it?
(ironically I am using it as a base for encrypted DMA :) )

Yeah, we are doing the same work. I saw a solution from Michael long
time ago (when there was still
a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
(https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)

For your patch, it only implement the interface for
HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
the parent object HostMemoryBackend, because besides the
MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
MEMORY_BACKEND_FILE can also be guest_memfd-backed.

Think more about where to implement this interface. It is still
uncertain to me. As I mentioned in another mail, maybe ram device memory
region would be backed by guest_memfd if we support TEE IO iommufd MMIO
in future. Then a specific object is more appropriate. What's your opinion?

I do not know about this. Unlike RAM, MMIO can only do "in-place 
conversion" and the interface to do so is not straight forward and VFIO 
owns MMIO anyway so the uAPI will be in iommufd, here is a gist of it:

https://github.com/aik/linux/commit/89e45c0404fa5006b2a4de33a4d582adf1ba9831

"guest request" is a communication channel from the VM to the secure FW 
(AMD's "PSP") to make MMIO allow encrypted access.




guest_memfd-backed host memory backend can register itself in the
target
MemoryRegion. However, this solution doesn't cover the scenario
where a
guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
e.g.
the virtual BIOS MemoryRegion.

What is this virtual BIOS MemoryRegion exactly? What does it look
like
in "info mtree -f"? Do we really want this memory to be DMAable?

virtual BIOS shows in a separate region:

     Root memory region: system
      0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
      ...
      00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM

Looks like a normal MR which can be backed by guest_memfd.

Yes, virtual BIOS memory region is initialized by
memory_region_init_ram_guest_memfd() which will be backed by a
guest_memfd.

The tricky thing is, for Intel TDX (not sure about AMD SEV), the
virtual
BIOS image will be loaded and then copied to private region.
After that,
the loaded image will be discarded and this region become useless.

I'd think it is loaded as "struct Rom" and then copied to the MR-
ram_guest_memfd() which does not leave MR useless - we still see
"pc.bios" in the list so it is not discarded. What piece of code are you
referring to exactly?

Sorry for confusion, maybe it is different between TDX and SEV-SNP for
the vBIOS handling.

In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
the vBIOS image to the shared part of the guest_memfd MR.
For TDX, it
will copy the image to private region (not the vBIOS guest_memfd MR
private part) and discard the shared part. So, although the memory
region still exists, it seems useless.
It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
vBIOS guest_memfd private memory?

This is what it looks like on my SNP VM (which, I suspect, is the same
as yours as hw/i386/pc.c does not distinguish Intel/AMD for this matter):

Yes, the memory region object is created on both TDX and SEV-SNP.


  Root memory region: system
   0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20
   00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27
   00000000000e0000-000000001fffffff (prio 0, ram): ram1
@00000000000e0000 KVM gmemfd=20
...
   00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM gmemfd=26

So the pc.bios MR exists and in use (hence its appearance in "info mtree
-f").


I added the gmemfd dumping:

--- a/system/memory.c
+++ b/system/memory.c
@@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key,
gpointer value,
                  }
              }
          }
+        if (mr->ram_block && mr->ram_block->guest_memfd >= 0) {
+            qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd);
+        }


Then I think the virtual BIOS is another case not belonging to
HostMemoryBackend which convince us to implement the interface in a
specific object, no?

TBH I have no idea why pc.rom and pc.bios are separate memory regions 
but in any case why do these 2 areas need to be treated any different 
than the rest of RAM? Thanks,





So I
feel like this virtual BIOS should not be backed by guest_memfd?

  From the above it sounds like the opposite, i.e. it should :)


      0000000100000000-000000017fffffff (prio 0, ram): pc.ram
@0000000080000000 KVM

Anyway if there is no guest_memfd backing it and
memory_region_has_ram_discard_manager() returns false, then the MR is
just going to be mapped for VFIO as usual which seems... alright,
right?

Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.

If we go with the HostMemoryBackend instead of guest_memfd_manager,
this
MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
just ignore it since the MR is useless (but looks not so good).

Sorry I am missing necessary details here, let's figure out the above.




We also consider to implement the interface in HostMemoryBackend, but
maybe implement with guest_memfd region is more general. We don't
know
if any DMAable memory would belong to HostMemoryBackend although at
present it is.

If it is more appropriate to implement it with HostMemoryBackend,
I can
change to this way.

Seems cleaner imho.

I can go this way.

[...]

+
+static int guest_memfd_rdm_replay_populated(const
RamDiscardManager
*rdm,
+                                            MemoryRegionSection
*section,
+                                            ReplayRamPopulate
replay_fn,
+                                            void *opaque)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    struct GuestMemfdReplayData data = { .fn =
replay_fn, .opaque =
opaque };
+
+    g_assert(section->mr == gmm->mr);
+    return guest_memfd_for_each_populated_section(gmm, section,
&data,
+
guest_memfd_rdm_replay_populated_cb);
+}
+
+static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
*section, void *arg)
+{
+    struct GuestMemfdReplayData *data = arg;
+    ReplayRamDiscard replay_fn = data->fn;
+
+    replay_fn(section, data->opaque);


guest_memfd_rdm_replay_populated_cb() checks for errors though.

It follows current definiton of ReplayRamDiscard() and
ReplayRamPopulate() where replay_discard() doesn't return errors and
replay_populate() returns errors.

A trace would be appropriate imho. Thanks,

Sorry, can't catch you. What kind of info to be traced? The errors
returned by replay_populate()?

Yeah. imho these are useful as we expect this part to work in general
too, right? Thanks,

Something like?

diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index 6b3e1ee9d6..4440ac9e59 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -185,8 +185,14 @@ static int
guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
   {
       struct GuestMemfdReplayData *data = arg;
       ReplayRamPopulate replay_fn = data->fn;
+    int ret;

-    return replay_fn(section, data->opaque);
+    ret = replay_fn(section, data->opaque);
+    if (ret) {
+        trace_guest_memfd_rdm_replay_populated_cb(ret);
+    }
+
+    return ret;
   }

How about just adding some error output in
guest_memfd_for_each_populated_section()/
guest_memfd_for_each_discarded_section()
if the cb() (i.e. replay_populate()) returns error?

this will do too, yes. Thanks,





--
Alexey