Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Chenyi Qiang <chenyi.qiang@xxxxxxxxx> · Thu, 9 Jan 2025 10:11:11 +0800

On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 8/1/25 21:56, Chenyi Qiang wrote:
>>
>>
>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>> uncoordinated discard") highlighted, some subsystems like VFIO might
>>>> disable ram block discard. However, guest_memfd relies on the discard
>>>> operation to perform page conversion between private and shared memory.
>>>> This can lead to stale IOMMU mapping issue when assigning a hardware
>>>> device to a confidential VM via shared memory (unprotected memory
>>>> pages). Blocking shared page discard can solve this problem, but it
>>>> could cause guests to consume twice the memory with VFIO, which is not
>>>> acceptable in some cases. An alternative solution is to convey other
>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>
>>>> RamDiscardManager is an existing concept (used by virtio-mem) to adjust
>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>> back in the other, so the similar work that needs to happen in response
>>>> to virtio-mem changes needs to happen for page conversion events.
>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>
>>>> However, guest_memfd is not an object so it cannot directly implement
>>>> the RamDiscardManager interface.
>>>>
>>>> One solution is to implement the interface in HostMemoryBackend. Any
>>>
>>> This sounds about right.
>>>
>>>> guest_memfd-backed host memory backend can register itself in the
>>>> target
>>>> MemoryRegion. However, this solution doesn't cover the scenario where a
>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
>>>> the virtual BIOS MemoryRegion.
>>>
>>> What is this virtual BIOS MemoryRegion exactly? What does it look like
>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>
>> virtual BIOS shows in a separate region:
>>
>>   Root memory region: system
>>    0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>    ...
>>    00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
> 
> Looks like a normal MR which can be backed by guest_memfd.

Yes, virtual BIOS memory region is initialized by
memory_region_init_ram_guest_memfd() which will be backed by a guest_memfd.

The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual
BIOS image will be loaded and then copied to private region. After that,
the loaded image will be discarded and this region become useless. So I
feel like this virtual BIOS should not be backed by guest_memfd?

> 
>>    0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>> @0000000080000000 KVM
> 
> Anyway if there is no guest_memfd backing it and
> memory_region_has_ram_discard_manager() returns false, then the MR is
> just going to be mapped for VFIO as usual which seems... alright, right?

Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.

If we go with the HostMemoryBackend instead of guest_memfd_manager, this
MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
just ignore it since the MR is useless (but looks not so good).

> 
> 
>> We also consider to implement the interface in HostMemoryBackend, but
>> maybe implement with guest_memfd region is more general. We don't know
>> if any DMAable memory would belong to HostMemoryBackend although at
>> present it is.
>>
>> If it is more appropriate to implement it with HostMemoryBackend, I can
>> change to this way.
> 
> Seems cleaner imho.

I can go this way.

> 
>>>
>>>
>>>> Thus, choose the second option, i.e. define an object type named
>>>> guest_memfd_manager with RamDiscardManager interface. Upon creation of
>>>> guest_memfd, a new guest_memfd_manager object can be instantiated and
>>>> registered to the managed guest_memfd MemoryRegion to handle the page
>>>> conversion events.
>>>>
>>>> In the context of guest_memfd, the discarded state signifies that the
>>>> page is private, while the populated state indicated that the page is
>>>> shared. The state of the memory is tracked at the granularity of the
>>>> host page size (i.e. block_size), as the minimum conversion size can be
>>>> one page per request.
>>>>
>>>> In addition, VFIO expects the DMA mapping for a specific iova to be
>>>> mapped and unmapped with the same granularity. However, the
>>>> confidential
>>>> VMs may do partial conversion, e.g. conversion happens on a small
>>>> region
>>>> within a large region. To prevent such invalid cases and before any
>>>> potential optimization comes out, all operations are performed with 4K
>>>> granularity.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@xxxxxxxxx>
>>>> ---
>>>>    include/sysemu/guest-memfd-manager.h |  46 +++++
>>>>    system/guest-memfd-manager.c         | 250 ++++++++++++++++++++++
>>>> +++++
>>>>    system/meson.build                   |   1 +
>>>>    3 files changed, 297 insertions(+)
>>>>    create mode 100644 include/sysemu/guest-memfd-manager.h
>>>>    create mode 100644 system/guest-memfd-manager.c
>>>>
>>>> diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/
>>>> guest-memfd-manager.h
>>>> new file mode 100644
>>>> index 0000000000..ba4a99b614
>>>> --- /dev/null
>>>> +++ b/include/sysemu/guest-memfd-manager.h
>>>> @@ -0,0 +1,46 @@
>>>> +/*
>>>> + * QEMU guest memfd manager
>>>> + *
>>>> + * Copyright Intel
>>>> + *
>>>> + * Author:
>>>> + *      Chenyi Qiang <chenyi.qiang@xxxxxxxxx>
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>>>> later.
>>>> + * See the COPYING file in the top-level directory
>>>> + *
>>>> + */
>>>> +
>>>> +#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
>>>> +#define SYSEMU_GUEST_MEMFD_MANAGER_H
>>>> +
>>>> +#include "sysemu/hostmem.h"
>>>> +
>>>> +#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
>>>> +
>>>> +OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass,
>>>> GUEST_MEMFD_MANAGER)
>>>> +
>>>> +struct GuestMemfdManager {
>>>> +    Object parent;
>>>> +
>>>> +    /* Managed memory region. */
>>>
>>> Do not need this comment. And the period.
>>
>> [...]
>>
>>>
>>>> +    MemoryRegion *mr;
>>>> +
>>>> +    /*
>>>> +     * 1-setting of the bit represents the memory is populated
>>>> (shared).
>>>> +     */
>>
>> Will fix it.
>>
>>>
>>> Could be 1 line comment.
>>>
>>>> +    int32_t bitmap_size;
>>>
>>> int or unsigned
>>>
>>>> +    unsigned long *bitmap;
>>>> +
>>>> +    /* block size and alignment */
>>>> +    uint64_t block_size;
>>>
>>> unsigned?
>>>
>>> (u)int(32|64)_t make sense for migrations which is not the case (yet?).
>>> Thanks,
>>
>> I think these fields would be helpful for future migration support.
>> Maybe defining as this way is more straightforward.
>>
>>>
>>>> +
>>>> +    /* listeners to notify on populate/discard activity. */
>>>
>>> Do not really need this comment either imho.
>>>
>>
>> I prefer to provide the comment for each field as virtio-mem do. If it
>> is not necessary, I would remove those obvious ones.
> 
> [bikeshedding on] But the "RamDiscardListener" word says that already,
> why repeating? :) It should add information, not duplicate. Like the
> block_size comment which mentions "alignment" [bikeshedding off]

Got it. Thanks!

> 
>>>> +    QLIST_HEAD(, RamDiscardListener) rdl_list;
>>>> +};
>>>> +
>>>> +struct GuestMemfdManagerClass {
>>>> +    ObjectClass parent_class;
>>>> +};
>>>> +
>>>> +#endif
>>
>> [...]
>>
>>             void *arg,
>>>> +
>>>> guest_memfd_section_cb cb)
>>>> +{
>>>> +    unsigned long first_one_bit, last_one_bit;
>>>> +    uint64_t offset, size;
>>>> +    int ret = 0;
>>>> +
>>>> +    first_one_bit = section->offset_within_region / gmm->block_size;
>>>> +    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>> first_one_bit);
>>>> +
>>>> +    while (first_one_bit < gmm->bitmap_size) {
>>>> +        MemoryRegionSection tmp = *section;
>>>> +
>>>> +        offset = first_one_bit * gmm->block_size;
>>>> +        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>> >bitmap_size,
>>>> +                                          first_one_bit + 1) - 1;
>>>> +        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
>>>
>>> This tries calling cb() on bigger chunks even though we say from the
>>> beginning that only page size is supported?
>>>
>>> May be simplify this for now and extend if/when VFIO learns to split
>>> mappings,  or  just drop it when we get in-place page state convertion
>>> (which will make this all irrelevant)?
>>
>> The cb() will call with big chunks but actually it do the split with the
>> granularity of block_size in the cb(). See the
>> vfio_ram_discard_notify_populate(), which do the DMA_MAP with
>> granularity size.
> 
> 
> Right, and this all happens inside QEMU - first the code finds bigger
> chunks and then it splits them anyway to call the VFIO driver. Seems
> pointless to bother about bigger chunks here.
> 
>>
>>>
>>>
>>>> +
>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>> size)) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        ret = cb(&tmp, arg);
>>>> +        if (ret) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>> +                                      last_one_bit + 2);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int guest_memfd_for_each_discarded_section(const
>>>> GuestMemfdManager *gmm,
>>>> +                                                  MemoryRegionSection
>>>> *section,
>>>> +                                                  void *arg,
>>>> +
>>>> guest_memfd_section_cb cb)
>>>> +{
>>>> +    unsigned long first_zero_bit, last_zero_bit;
>>>> +    uint64_t offset, size;
>>>> +    int ret = 0;
>>>> +
>>>> +    first_zero_bit = section->offset_within_region / gmm->block_size;
>>>> +    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
>>>> +                                        first_zero_bit);
>>>> +
>>>> +    while (first_zero_bit < gmm->bitmap_size) {
>>>> +        MemoryRegionSection tmp = *section;
>>>> +
>>>> +        offset = first_zero_bit * gmm->block_size;
>>>> +        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
>>>> +                                      first_zero_bit + 1) - 1;
>>>> +        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
>>>> +
>>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>>> size)) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        ret = cb(&tmp, arg);
>>>> +        if (ret) {
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm-
>>>>> bitmap_size,
>>>> +                                            last_zero_bit + 2);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static uint64_t guest_memfd_rdm_get_min_granularity(const
>>>> RamDiscardManager *rdm,
>>>> +                                                    const
>>>> MemoryRegion *mr)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +
>>>> +    g_assert(mr == gmm->mr);
>>>> +    return gmm->block_size;
>>>> +}
>>>> +
>>>> +static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
>>>> +                                              RamDiscardListener *rdl,
>>>> +                                              MemoryRegionSection
>>>> *section)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    int ret;
>>>> +
>>>> +    g_assert(section->mr == gmm->mr);
>>>> +    rdl->section = memory_region_section_new_copy(section);
>>>> +
>>>> +    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
>>>> +
>>>> +    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
>>>> +
>>>> guest_memfd_notify_populate_cb);
>>>> +    if (ret) {
>>>> +        error_report("%s: Failed to register RAM discard listener:
>>>> %s", __func__,
>>>> +                     strerror(-ret));
>>>> +    }
>>>> +}
>>>> +
>>>> +static void guest_memfd_rdm_unregister_listener(RamDiscardManager
>>>> *rdm,
>>>> +                                                RamDiscardListener
>>>> *rdl)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    int ret;
>>>> +
>>>> +    g_assert(rdl->section);
>>>> +    g_assert(rdl->section->mr == gmm->mr);
>>>> +
>>>> +    ret = guest_memfd_for_each_populated_section(gmm, rdl->section,
>>>> rdl,
>>>> +
>>>> guest_memfd_notify_discard_cb);
>>>> +    if (ret) {
>>>> +        error_report("%s: Failed to unregister RAM discard listener:
>>>> %s", __func__,
>>>> +                     strerror(-ret));
>>>> +    }
>>>> +
>>>> +    memory_region_section_free_copy(rdl->section);
>>>> +    rdl->section = NULL;
>>>> +    QLIST_REMOVE(rdl, next);
>>>> +
>>>> +}
>>>> +
>>>> +typedef struct GuestMemfdReplayData {
>>>> +    void *fn;
>>>
>>> s/void */ReplayRamPopulate/
>>
>> [...]
>>
>>>
>>>> +    void *opaque;
>>>> +} GuestMemfdReplayData;
>>>> +
>>>> +static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection
>>>> *section, void *arg)
>>>> +{
>>>> +    struct GuestMemfdReplayData *data = arg;
>>>
>>> Drop "struct" here and below.
>>
>> Fixed. Thanks!
>>
>>>
>>>> +    ReplayRamPopulate replay_fn = data->fn;
>>>> +
>>>> +    return replay_fn(section, data->opaque);
>>>> +}
>>>> +
>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager
>>>> *rdm,
>>>> +                                            MemoryRegionSection
>>>> *section,
>>>> +                                            ReplayRamPopulate
>>>> replay_fn,
>>>> +                                            void *opaque)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>> opaque };
>>>> +
>>>> +    g_assert(section->mr == gmm->mr);
>>>> +    return guest_memfd_for_each_populated_section(gmm, section, &data,
>>>> +
>>>> guest_memfd_rdm_replay_populated_cb);
>>>> +}
>>>> +
>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>> *section, void *arg)
>>>> +{
>>>> +    struct GuestMemfdReplayData *data = arg;
>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>> +
>>>> +    replay_fn(section, data->opaque);
>>>
>>>
>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>
>> It follows current definiton of ReplayRamDiscard() and
>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>> replay_populate() returns errors.
> 
> A trace would be appropriate imho. Thanks,

Sorry, can't catch you. What kind of info to be traced? The errors
returned by replay_populate()?

> 
>>>
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager
>>>> *rdm,
>>>> +                                             MemoryRegionSection
>>>> *section,
>>>> +                                             ReplayRamDiscard
>>>> replay_fn,
>>>> +                                             void *opaque)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>> +    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque =
>>>> opaque };
>>>> +
>>>> +    g_assert(section->mr == gmm->mr);
>>>> +    guest_memfd_for_each_discarded_section(gmm, section, &data,
>>>> +
>>>> guest_memfd_rdm_replay_discarded_cb);
>>>> +}
>>>> +
>>>> +static void guest_memfd_manager_init(Object *obj)
>>>> +{
>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
>>>> +
>>>> +    QLIST_INIT(&gmm->rdl_list);
>>>> +}
>>>> +
>>>> +static void guest_memfd_manager_finalize(Object *obj)
>>>> +{
>>>> +    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
>>>
>>>
>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks,
>>
>> Will remove it. Thanks.
>>
>>>
>>>
>>>> +}
>>>> +
>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void
>>>> *data)
>>>> +{
>>>> +    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
>>>> +
>>>> +    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
>>>> +    rdmc->register_listener = guest_memfd_rdm_register_listener;
>>>> +    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
>>>> +    rdmc->is_populated = guest_memfd_rdm_is_populated;
>>>> +    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
>>>> +    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
>>>> +}
>>>> diff --git a/system/meson.build b/system/meson.build
>>>> index 4952f4b2c7..ed4e1137bd 100644
>>>> --- a/system/meson.build
>>>> +++ b/system/meson.build
>>>> @@ -15,6 +15,7 @@ system_ss.add(files(
>>>>      'dirtylimit.c',
>>>>      'dma-helpers.c',
>>>>      'globals.c',
>>>> +  'guest-memfd-manager.c',
>>>>      'memory_mapping.c',
>>>>      'qdev-monitor.c',
>>>>      'qtest.c',
>>>
>>
>