Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Chenyi Qiang <chenyi.qiang@xxxxxxxxx> · Thu, 9 Jan 2025 12:29:12 +0800

On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 13:11, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>>>> operation to perform page conversion between private and shared
>>>>>> memory.
>>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>> not
>>>>>> acceptable in some cases. An alternative solution is to convey other
>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>
>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>> adjust
>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>> adding it
>>>>>> back in the other, so the similar work that needs to happen in
>>>>>> response
>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>
>>>>>> However, guest_memfd is not an object so it cannot directly implement
>>>>>> the RamDiscardManager interface.
>>>>>>
>>>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>>>
>>>>> This sounds about right.
>>>>>
>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>> target
>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>> where a
>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>> e.g.
>>>>>> the virtual BIOS MemoryRegion.
>>>>>
>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>
>>>> virtual BIOS shows in a separate region:
>>>>
>>>>    Root memory region: system
>>>>     0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>     ...
>>>>     00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>
>>> Looks like a normal MR which can be backed by guest_memfd.
>>
>> Yes, virtual BIOS memory region is initialized by
>> memory_region_init_ram_guest_memfd() which will be backed by a
>> guest_memfd.
>>
>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
>> BIOS image will be loaded and then copied to private region.
>> After that,
>> the loaded image will be discarded and this region become useless.
> 
> I'd think it is loaded as "struct Rom" and then copied to the MR-
> ram_guest_memfd() which does not leave MR useless - we still see
> "pc.bios" in the list so it is not discarded. What piece of code are you
> referring to exactly?

Sorry for confusion, maybe it is different between TDX and SEV-SNP for
the vBIOS handling.

In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
the vBIOS image to the shared part of the guest_memfd MR. For TDX, it
will copy the image to private region (not the vBIOS guest_memfd MR
private part) and discard the shared part. So, although the memory
region still exists, it seems useless.

It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
vBIOS guest_memfd private memory?

> 
> 
>> So I
>> feel like this virtual BIOS should not be backed by guest_memfd?
> 
> From the above it sounds like the opposite, i.e. it should :)
> 
>>>
>>>>     0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>>> @0000000080000000 KVM
>>>
>>> Anyway if there is no guest_memfd backing it and
>>> memory_region_has_ram_discard_manager() returns false, then the MR is
>>> just going to be mapped for VFIO as usual which seems... alright, right?
>>
>> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
>> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
>>
>> If we go with the HostMemoryBackend instead of guest_memfd_manager, this
>> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
>> just ignore it since the MR is useless (but looks not so good).
> 
> Sorry I am missing necessary details here, let's figure out the above.
> 
>>
>>>
>>>
>>>> We also consider to implement the interface in HostMemoryBackend, but
>>>> maybe implement with guest_memfd region is more general. We don't know
>>>> if any DMAable memory would belong to HostMemoryBackend although at
>>>> present it is.
>>>>
>>>> If it is more appropriate to implement it with HostMemoryBackend, I can
>>>> change to this way.
>>>
>>> Seems cleaner imho.
>>
>> I can go this way.

[...]

>>>>>> +
>>>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>>>>> *rdm,
>>>>>> +                                            MemoryRegionSection
>>>>>> *section,
>>>>>> +                                            ReplayRamPopulate
>>>>>> replay_fn,
>>>>>> +                                            void *opaque)
>>>>>> +{
>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>>> opaque };
>>>>>> +
>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>> +    return guest_memfd_for_each_populated_section(gmm, section,
>>>>>> &data,
>>>>>> +
>>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>>> +}
>>>>>> +
>>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>>> *section, void *arg)
>>>>>> +{
>>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>>> +
>>>>>> +    replay_fn(section, data->opaque);
>>>>>
>>>>>
>>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>>
>>>> It follows current definiton of ReplayRamDiscard() and
>>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>>> replay_populate() returns errors.
>>>
>>> A trace would be appropriate imho. Thanks,
>>
>> Sorry, can't catch you. What kind of info to be traced? The errors
>> returned by replay_populate()?
> 
> Yeah. imho these are useful as we expect this part to work in general
> too, right? Thanks,

Something like?

diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index 6b3e1ee9d6..4440ac9e59 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -185,8 +185,14 @@ static int
guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
 {
     struct GuestMemfdReplayData *data = arg;
     ReplayRamPopulate replay_fn = data->fn;
+    int ret;

-    return replay_fn(section, data->opaque);
+    ret = replay_fn(section, data->opaque);
+    if (ret) {
+        trace_guest_memfd_rdm_replay_populated_cb(ret);
+    }
+
+    return ret;
 }

How about just adding some error output in
guest_memfd_for_each_populated_section()/guest_memfd_for_each_discarded_section()
if the cb() (i.e. replay_populate()) returns error?

> 
>>
>>>
>>>>>
>>>>>> +
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>>>>> *rdm,
>>>>>> +                                             MemoryRegionSection
>>>>>> *section,
>>>>>> +                                             ReplayRamDiscard
>>>>>> replay_fn,
>>>>>> +                                             void *opaque)
>>>>>> +{
>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>>>> opaque };
>>>>>> +
>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>>>>> +
>>>>>> guest_memfd_rdm_replay_discarded_cb);
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_manager_init(Object *obj)
>>>>>> +{
>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>>>> +
>>>>>> +    QLIST_INIT(&gmm->rdl_list);
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_manager_finalize(Object *obj)
>>>>>> +{
>>>>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>>>
>>>>>
>>>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
>>>>
>>>> Will remove it. Thanks.
>>>>
>>>>>
>>>>>
>>>>>> +}
>>>>>> +
>>>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>>>>> *data)
>>>>>> +{
>>>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>>>> +
>>>>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>>>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>>>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>>>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>>>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>>>>> +}
>>>>>> diff --git a/system/meson.build b/system/meson.build
>>>>>> index 4952f4b2c7..ed4e1137bd 100644
>>>>>> --- a/system/meson.build
>>>>>> +++ b/system/meson.build
>>>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>>>       'dirtylimit.c',
>>>>>>       'dma-helpers.c',
>>>>>>       'globals.c',
>>>>>> +  'guest-memfd-manager.c',
>>>>>>       'memory_mapping.c',
>>>>>>       'qdev-monitor.c',
>>>>>>       'qtest.c',
>>>>>
>>>>
>>>
>>
>