On 9/1/25 15:29, Chenyi Qiang wrote:
On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
On 9/1/25 13:11, Chenyi Qiang wrote:
On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
On 8/1/25 21:56, Chenyi Qiang wrote:
On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
On 13/12/24 18:08, Chenyi Qiang wrote:
As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
uncoordinated discard") highlighted, some subsystems like VFIO might
disable ram block discard. However, guest_memfd relies on the discard
operation to perform page conversion between private and shared
memory.
This can lead to stale IOMMU mapping issue when assigning a hardware
device to a confidential VM via shared memory (unprotected memory
pages). Blocking shared page discard can solve this problem, but it
could cause guests to consume twice the memory with VFIO, which is
not
acceptable in some cases. An alternative solution is to convey other
systems like VFIO to refresh its outdated IOMMU mappings.
RamDiscardManager is an existing concept (used by virtio-mem) to
adjust
VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and
adding it
back in the other, so the similar work that needs to happen in
response
to virtio-mem changes needs to happen for page conversion events.
Introduce the RamDiscardManager to guest_memfd to achieve it.
However, guest_memfd is not an object so it cannot directly implement
the RamDiscardManager interface.
One solution is to implement the interface in HostMemoryBackend. Any
This sounds about right.
btw I am using this for ages:
https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
but I am not sure if this ever saw the light of the day, did not it?
(ironically I am using it as a base for encrypted DMA :) )
guest_memfd-backed host memory backend can register itself in the
target
MemoryRegion. However, this solution doesn't cover the scenario
where a
guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
e.g.
the virtual BIOS MemoryRegion.
What is this virtual BIOS MemoryRegion exactly? What does it look like
in "info mtree -f"? Do we really want this memory to be DMAable?
virtual BIOS shows in a separate region:
Root memory region: system
0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
...
00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
Looks like a normal MR which can be backed by guest_memfd.
Yes, virtual BIOS memory region is initialized by
memory_region_init_ram_guest_memfd() which will be backed by a
guest_memfd.
The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
BIOS image will be loaded and then copied to private region.
After that,
the loaded image will be discarded and this region become useless.
I'd think it is loaded as "struct Rom" and then copied to the MR-
ram_guest_memfd() which does not leave MR useless - we still see
"pc.bios" in the list so it is not discarded. What piece of code are you
referring to exactly?
Sorry for confusion, maybe it is different between TDX and SEV-SNP for
the vBIOS handling.
In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
the vBIOS image to the shared part of the guest_memfd MR.
For TDX, it
will copy the image to private region (not the vBIOS guest_memfd MR
private part) and discard the shared part. So, although the memory
region still exists, it seems useless.
It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
vBIOS guest_memfd private memory?
This is what it looks like on my SNP VM (which, I suspect, is the same
as yours as hw/i386/pc.c does not distinguish Intel/AMD for this matter):
Root memory region: system
0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20
00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27
00000000000e0000-000000001fffffff (prio 0, ram): ram1
@00000000000e0000 KVM gmemfd=20
...
00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM gmemfd=26
So the pc.bios MR exists and in use (hence its appearance in "info mtree
-f").
I added the gmemfd dumping:
--- a/system/memory.c
+++ b/system/memory.c
@@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key,
gpointer value,
}
}
}
+ if (mr->ram_block && mr->ram_block->guest_memfd >= 0) {
+ qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd);
+ }
So I
feel like this virtual BIOS should not be backed by guest_memfd?
From the above it sounds like the opposite, i.e. it should :)
0000000100000000-000000017fffffff (prio 0, ram): pc.ram
@0000000080000000 KVM
Anyway if there is no guest_memfd backing it and
memory_region_has_ram_discard_manager() returns false, then the MR is
just going to be mapped for VFIO as usual which seems... alright, right?
Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
If we go with the HostMemoryBackend instead of guest_memfd_manager, this
MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
just ignore it since the MR is useless (but looks not so good).
Sorry I am missing necessary details here, let's figure out the above.
We also consider to implement the interface in HostMemoryBackend, but
maybe implement with guest_memfd region is more general. We don't know
if any DMAable memory would belong to HostMemoryBackend although at
present it is.
If it is more appropriate to implement it with HostMemoryBackend, I can
change to this way.
Seems cleaner imho.
I can go this way.
[...]
+
+static int guest_memfd_rdm_replay_populated(const RamDiscardManager
*rdm,
+ MemoryRegionSection
*section,
+ ReplayRamPopulate
replay_fn,
+ void *opaque)
+{
+ GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+ struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
opaque };
+
+ g_assert(section->mr == gmm->mr);
+ return guest_memfd_for_each_populated_section(gmm, section,
&data,
+
guest_memfd_rdm_replay_populated_cb);
+}
+
+static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
*section, void *arg)
+{
+ struct GuestMemfdReplayData *data = arg;
+ ReplayRamDiscard replay_fn = data->fn;
+
+ replay_fn(section, data->opaque);
guest_memfd_rdm_replay_populated_cb() checks for errors though.
It follows current definiton of ReplayRamDiscard() and
ReplayRamPopulate() where replay_discard() doesn't return errors and
replay_populate() returns errors.
A trace would be appropriate imho. Thanks,
Sorry, can't catch you. What kind of info to be traced? The errors
returned by replay_populate()?
Yeah. imho these are useful as we expect this part to work in general
too, right? Thanks,
Something like?
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index 6b3e1ee9d6..4440ac9e59 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -185,8 +185,14 @@ static int
guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
{
struct GuestMemfdReplayData *data = arg;
ReplayRamPopulate replay_fn = data->fn;
+ int ret;
- return replay_fn(section, data->opaque);
+ ret = replay_fn(section, data->opaque);
+ if (ret) {
+ trace_guest_memfd_rdm_replay_populated_cb(ret);
+ }
+
+ return ret;
}
How about just adding some error output in
guest_memfd_for_each_populated_section()/guest_memfd_for_each_discarded_section()
if the cb() (i.e. replay_populate()) returns error?
this will do too, yes. Thanks,
+
+ return 0;
+}
+
+static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
*rdm,
+ MemoryRegionSection
*section,
+ ReplayRamDiscard
replay_fn,
+ void *opaque)
+{
+ GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+ struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
opaque };
+
+ g_assert(section->mr == gmm->mr);
+ guest_memfd_for_each_discarded_section(gmm, section, &data,
+
guest_memfd_rdm_replay_discarded_cb);
+}
+
+static void guest_memfd_manager_init(Object *obj)
+{
+ GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
+
+ QLIST_INIT(&gmm->rdl_list);
+}
+
+static void guest_memfd_manager_finalize(Object *obj)
+{
+ g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
Will remove it. Thanks.
+}
+
+static void guest_memfd_manager_class_init(ObjectClass *oc, void
*data)
+{
+ RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
+
+ rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
+ rdmc->register_listener = guest_memfd_rdm_register_listener;
+ rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
+ rdmc->is_populated = guest_memfd_rdm_is_populated;
+ rdmc->replay_populated = guest_memfd_rdm_replay_populated;
+ rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
+}
diff --git a/system/meson.build b/system/meson.build
index 4952f4b2c7..ed4e1137bd 100644
--- a/system/meson.build
+++ b/system/meson.build
@@ -15,6 +15,7 @@ system_ss.add(files(
'dirtylimit.c',
'dma-helpers.c',
'globals.c',
+ 'guest-memfd-manager.c',
'memory_mapping.c',
'qdev-monitor.c',
'qtest.c',
--
Alexey