On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote: > > > On 9/1/25 13:11, Chenyi Qiang wrote: >> >> >> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote: >>> >>> >>> On 8/1/25 21:56, Chenyi Qiang wrote: >>>> >>>> >>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote: >>>>> On 13/12/24 18:08, Chenyi Qiang wrote: >>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require >>>>>> uncoordinated discard") highlighted, some subsystems like VFIO might >>>>>> disable ram block discard. However, guest_memfd relies on the discard >>>>>> operation to perform page conversion between private and shared >>>>>> memory. >>>>>> This can lead to stale IOMMU mapping issue when assigning a hardware >>>>>> device to a confidential VM via shared memory (unprotected memory >>>>>> pages). Blocking shared page discard can solve this problem, but it >>>>>> could cause guests to consume twice the memory with VFIO, which is >>>>>> not >>>>>> acceptable in some cases. An alternative solution is to convey other >>>>>> systems like VFIO to refresh its outdated IOMMU mappings. >>>>>> >>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to >>>>>> adjust >>>>>> VFIO mappings in relation to VM page assignment. Effectively page >>>>>> conversion is similar to hot-removing a page in one mode and >>>>>> adding it >>>>>> back in the other, so the similar work that needs to happen in >>>>>> response >>>>>> to virtio-mem changes needs to happen for page conversion events. >>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it. >>>>>> >>>>>> However, guest_memfd is not an object so it cannot directly implement >>>>>> the RamDiscardManager interface. >>>>>> >>>>>> One solution is to implement the interface in HostMemoryBackend. Any >>>>> >>>>> This sounds about right. >>>>> >>>>>> guest_memfd-backed host memory backend can register itself in the >>>>>> target >>>>>> MemoryRegion. However, this solution doesn't cover the scenario >>>>>> where a >>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, >>>>>> e.g. >>>>>> the virtual BIOS MemoryRegion. >>>>> >>>>> What is this virtual BIOS MemoryRegion exactly? What does it look like >>>>> in "info mtree -f"? Do we really want this memory to be DMAable? >>>> >>>> virtual BIOS shows in a separate region: >>>> >>>> Root memory region: system >>>> 0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM >>>> ... >>>> 00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM >>> >>> Looks like a normal MR which can be backed by guest_memfd. >> >> Yes, virtual BIOS memory region is initialized by >> memory_region_init_ram_guest_memfd() which will be backed by a >> guest_memfd. >> >> The tricky thing is, for Intel TDX (not sure about AMD SEV), the virtual >> BIOS image will be loaded and then copied to private region. >> After that, >> the loaded image will be discarded and this region become useless. > > I'd think it is loaded as "struct Rom" and then copied to the MR- > ram_guest_memfd() which does not leave MR useless - we still see > "pc.bios" in the list so it is not discarded. What piece of code are you > referring to exactly? Sorry for confusion, maybe it is different between TDX and SEV-SNP for the vBIOS handling. In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads the vBIOS image to the shared part of the guest_memfd MR. For TDX, it will copy the image to private region (not the vBIOS guest_memfd MR private part) and discard the shared part. So, although the memory region still exists, it seems useless. It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in vBIOS guest_memfd private memory? > > >> So I >> feel like this virtual BIOS should not be backed by guest_memfd? > > From the above it sounds like the opposite, i.e. it should :) > >>> >>>> 0000000100000000-000000017fffffff (prio 0, ram): pc.ram >>>> @0000000080000000 KVM >>> >>> Anyway if there is no guest_memfd backing it and >>> memory_region_has_ram_discard_manager() returns false, then the MR is >>> just going to be mapped for VFIO as usual which seems... alright, right? >> >> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM >> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO. >> >> If we go with the HostMemoryBackend instead of guest_memfd_manager, this >> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or >> just ignore it since the MR is useless (but looks not so good). > > Sorry I am missing necessary details here, let's figure out the above. > >> >>> >>> >>>> We also consider to implement the interface in HostMemoryBackend, but >>>> maybe implement with guest_memfd region is more general. We don't know >>>> if any DMAable memory would belong to HostMemoryBackend although at >>>> present it is. >>>> >>>> If it is more appropriate to implement it with HostMemoryBackend, I can >>>> change to this way. >>> >>> Seems cleaner imho. >> >> I can go this way. [...] >>>>>> + >>>>>> +static int guest_memfd_rdm_replay_populated(const RamDiscardManager >>>>>> *rdm, >>>>>> + MemoryRegionSection >>>>>> *section, >>>>>> + ReplayRamPopulate >>>>>> replay_fn, >>>>>> + void *opaque) >>>>>> +{ >>>>>> + GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>>>>> + struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = >>>>>> opaque }; >>>>>> + >>>>>> + g_assert(section->mr == gmm->mr); >>>>>> + return guest_memfd_for_each_populated_section(gmm, section, >>>>>> &data, >>>>>> + >>>>>> guest_memfd_rdm_replay_populated_cb); >>>>>> +} >>>>>> + >>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection >>>>>> *section, void *arg) >>>>>> +{ >>>>>> + struct GuestMemfdReplayData *data = arg; >>>>>> + ReplayRamDiscard replay_fn = data->fn; >>>>>> + >>>>>> + replay_fn(section, data->opaque); >>>>> >>>>> >>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though. >>>> >>>> It follows current definiton of ReplayRamDiscard() and >>>> ReplayRamPopulate() where replay_discard() doesn't return errors and >>>> replay_populate() returns errors. >>> >>> A trace would be appropriate imho. Thanks, >> >> Sorry, can't catch you. What kind of info to be traced? The errors >> returned by replay_populate()? > > Yeah. imho these are useful as we expect this part to work in general > too, right? Thanks, Something like? diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c index 6b3e1ee9d6..4440ac9e59 100644 --- a/system/guest-memfd-manager.c +++ b/system/guest-memfd-manager.c @@ -185,8 +185,14 @@ static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi { struct GuestMemfdReplayData *data = arg; ReplayRamPopulate replay_fn = data->fn; + int ret; - return replay_fn(section, data->opaque); + ret = replay_fn(section, data->opaque); + if (ret) { + trace_guest_memfd_rdm_replay_populated_cb(ret); + } + + return ret; } How about just adding some error output in guest_memfd_for_each_populated_section()/guest_memfd_for_each_discarded_section() if the cb() (i.e. replay_populate()) returns error? > >> >>> >>>>> >>>>>> + >>>>>> + return 0; >>>>>> +} >>>>>> + >>>>>> +static void guest_memfd_rdm_replay_discarded(const RamDiscardManager >>>>>> *rdm, >>>>>> + MemoryRegionSection >>>>>> *section, >>>>>> + ReplayRamDiscard >>>>>> replay_fn, >>>>>> + void *opaque) >>>>>> +{ >>>>>> + GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>>>>> + struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = >>>>>> opaque }; >>>>>> + >>>>>> + g_assert(section->mr == gmm->mr); >>>>>> + guest_memfd_for_each_discarded_section(gmm, section, &data, >>>>>> + >>>>>> guest_memfd_rdm_replay_discarded_cb); >>>>>> +} >>>>>> + >>>>>> +static void guest_memfd_manager_init(Object *obj) >>>>>> +{ >>>>>> + GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj); >>>>>> + >>>>>> + QLIST_INIT(&gmm->rdl_list); >>>>>> +} >>>>>> + >>>>>> +static void guest_memfd_manager_finalize(Object *obj) >>>>>> +{ >>>>>> + g_free(GUEST_MEMFD_MANAGER(obj)->bitmap); >>>>> >>>>> >>>>> bitmap is not allocated though. And 5/7 removes this anyway. Thanks, >>>> >>>> Will remove it. Thanks. >>>> >>>>> >>>>> >>>>>> +} >>>>>> + >>>>>> +static void guest_memfd_manager_class_init(ObjectClass *oc, void >>>>>> *data) >>>>>> +{ >>>>>> + RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc); >>>>>> + >>>>>> + rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity; >>>>>> + rdmc->register_listener = guest_memfd_rdm_register_listener; >>>>>> + rdmc->unregister_listener = guest_memfd_rdm_unregister_listener; >>>>>> + rdmc->is_populated = guest_memfd_rdm_is_populated; >>>>>> + rdmc->replay_populated = guest_memfd_rdm_replay_populated; >>>>>> + rdmc->replay_discarded = guest_memfd_rdm_replay_discarded; >>>>>> +} >>>>>> diff --git a/system/meson.build b/system/meson.build >>>>>> index 4952f4b2c7..ed4e1137bd 100644 >>>>>> --- a/system/meson.build >>>>>> +++ b/system/meson.build >>>>>> @@ -15,6 +15,7 @@ system_ss.add(files( >>>>>> 'dirtylimit.c', >>>>>> 'dma-helpers.c', >>>>>> 'globals.c', >>>>>> + 'guest-memfd-manager.c', >>>>>> 'memory_mapping.c', >>>>>> 'qdev-monitor.c', >>>>>> 'qtest.c', >>>>> >>>> >>> >> >