On 1/10/2025 8:58 AM, Alexey Kardashevskiy wrote: > > > On 9/1/25 15:29, Chenyi Qiang wrote: >> >> >> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote: >>> >>> >>> On 9/1/25 13:11, Chenyi Qiang wrote: >>>> >>>> >>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote: >>>>> >>>>> >>>>> On 8/1/25 21:56, Chenyi Qiang wrote: >>>>>> >>>>>> >>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote: >>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote: >>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require >>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO >>>>>>>> might >>>>>>>> disable ram block discard. However, guest_memfd relies on the >>>>>>>> discard >>>>>>>> operation to perform page conversion between private and shared >>>>>>>> memory. >>>>>>>> This can lead to stale IOMMU mapping issue when assigning a >>>>>>>> hardware >>>>>>>> device to a confidential VM via shared memory (unprotected memory >>>>>>>> pages). Blocking shared page discard can solve this problem, but it >>>>>>>> could cause guests to consume twice the memory with VFIO, which is >>>>>>>> not >>>>>>>> acceptable in some cases. An alternative solution is to convey >>>>>>>> other >>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings. >>>>>>>> >>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to >>>>>>>> adjust >>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page >>>>>>>> conversion is similar to hot-removing a page in one mode and >>>>>>>> adding it >>>>>>>> back in the other, so the similar work that needs to happen in >>>>>>>> response >>>>>>>> to virtio-mem changes needs to happen for page conversion events. >>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it. >>>>>>>> >>>>>>>> However, guest_memfd is not an object so it cannot directly >>>>>>>> implement >>>>>>>> the RamDiscardManager interface. >>>>>>>> >>>>>>>> One solution is to implement the interface in HostMemoryBackend. >>>>>>>> Any >>>>>>> >>>>>>> This sounds about right. > > btw I am using this for ages: > > https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46 > > but I am not sure if this ever saw the light of the day, did not it? > (ironically I am using it as a base for encrypted DMA :) ) Yeah, we are doing the same work. I saw a solution from Michael long time ago (when there was still a dedicated hostmem-memfd-private backend for restrictedmem/gmem) (https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6) For your patch, it only implement the interface for HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for the parent object HostMemoryBackend, because besides the MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and MEMORY_BACKEND_FILE can also be guest_memfd-backed. Think more about where to implement this interface. It is still uncertain to me. As I mentioned in another mail, maybe ram device memory region would be backed by guest_memfd if we support TEE IO iommufd MMIO in future. Then a specific object is more appropriate. What's your opinion? > >>>>>>> >>>>>>>> guest_memfd-backed host memory backend can register itself in the >>>>>>>> target >>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario >>>>>>>> where a >>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, >>>>>>>> e.g. >>>>>>>> the virtual BIOS MemoryRegion. >>>>>>> >>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look >>>>>>> like >>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable? >>>>>> >>>>>> virtual BIOS shows in a separate region: >>>>>> >>>>>> Root memory region: system >>>>>> 0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM >>>>>> ... >>>>>> 00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM >>>>> >>>>> Looks like a normal MR which can be backed by guest_memfd. >>>> >>>> Yes, virtual BIOS memory region is initialized by >>>> memory_region_init_ram_guest_memfd() which will be backed by a >>>> guest_memfd. >>>> >>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the >>>> virtual >>>> BIOS image will be loaded and then copied to private region. >>>> After that, >>>> the loaded image will be discarded and this region become useless. >>> >>> I'd think it is loaded as "struct Rom" and then copied to the MR- >>> ram_guest_memfd() which does not leave MR useless - we still see >>> "pc.bios" in the list so it is not discarded. What piece of code are you >>> referring to exactly? >> >> Sorry for confusion, maybe it is different between TDX and SEV-SNP for >> the vBIOS handling. >> >> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads >> the vBIOS image to the shared part of the guest_memfd MR. >> For TDX, it >> will copy the image to private region (not the vBIOS guest_memfd MR >> private part) and discard the shared part. So, although the memory >> region still exists, it seems useless. >> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in >> vBIOS guest_memfd private memory? > > This is what it looks like on my SNP VM (which, I suspect, is the same > as yours as hw/i386/pc.c does not distinguish Intel/AMD for this matter): Yes, the memory region object is created on both TDX and SEV-SNP. > > Root memory region: system > 0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20 > 00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27 > 00000000000e0000-000000001fffffff (prio 0, ram): ram1 > @00000000000e0000 KVM gmemfd=20 > ... > 00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM gmemfd=26 > > So the pc.bios MR exists and in use (hence its appearance in "info mtree > -f"). > > > I added the gmemfd dumping: > > --- a/system/memory.c > +++ b/system/memory.c > @@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key, > gpointer value, > } > } > } > + if (mr->ram_block && mr->ram_block->guest_memfd >= 0) { > + qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd); > + } > Then I think the virtual BIOS is another case not belonging to HostMemoryBackend which convince us to implement the interface in a specific object, no? > >>> >>> >>>> So I >>>> feel like this virtual BIOS should not be backed by guest_memfd? >>> >>> From the above it sounds like the opposite, i.e. it should :) >>> >>>>> >>>>>> 0000000100000000-000000017fffffff (prio 0, ram): pc.ram >>>>>> @0000000080000000 KVM >>>>> >>>>> Anyway if there is no guest_memfd backing it and >>>>> memory_region_has_ram_discard_manager() returns false, then the MR is >>>>> just going to be mapped for VFIO as usual which seems... alright, >>>>> right? >>>> >>>> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM >>>> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO. >>>> >>>> If we go with the HostMemoryBackend instead of guest_memfd_manager, >>>> this >>>> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or >>>> just ignore it since the MR is useless (but looks not so good). >>> >>> Sorry I am missing necessary details here, let's figure out the above. >>> >>>> >>>>> >>>>> >>>>>> We also consider to implement the interface in HostMemoryBackend, but >>>>>> maybe implement with guest_memfd region is more general. We don't >>>>>> know >>>>>> if any DMAable memory would belong to HostMemoryBackend although at >>>>>> present it is. >>>>>> >>>>>> If it is more appropriate to implement it with HostMemoryBackend, >>>>>> I can >>>>>> change to this way. >>>>> >>>>> Seems cleaner imho. >>>> >>>> I can go this way. >> >> [...] >> >>>>>>>> + >>>>>>>> +static int guest_memfd_rdm_replay_populated(const >>>>>>>> RamDiscardManager >>>>>>>> *rdm, >>>>>>>> + MemoryRegionSection >>>>>>>> *section, >>>>>>>> + ReplayRamPopulate >>>>>>>> replay_fn, >>>>>>>> + void *opaque) >>>>>>>> +{ >>>>>>>> + GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm); >>>>>>>> + struct GuestMemfdReplayData data = { .fn = >>>>>>>> replay_fn, .opaque = >>>>>>>> opaque }; >>>>>>>> + >>>>>>>> + g_assert(section->mr == gmm->mr); >>>>>>>> + return guest_memfd_for_each_populated_section(gmm, section, >>>>>>>> &data, >>>>>>>> + >>>>>>>> guest_memfd_rdm_replay_populated_cb); >>>>>>>> +} >>>>>>>> + >>>>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection >>>>>>>> *section, void *arg) >>>>>>>> +{ >>>>>>>> + struct GuestMemfdReplayData *data = arg; >>>>>>>> + ReplayRamDiscard replay_fn = data->fn; >>>>>>>> + >>>>>>>> + replay_fn(section, data->opaque); >>>>>>> >>>>>>> >>>>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though. >>>>>> >>>>>> It follows current definiton of ReplayRamDiscard() and >>>>>> ReplayRamPopulate() where replay_discard() doesn't return errors and >>>>>> replay_populate() returns errors. >>>>> >>>>> A trace would be appropriate imho. Thanks, >>>> >>>> Sorry, can't catch you. What kind of info to be traced? The errors >>>> returned by replay_populate()? >>> >>> Yeah. imho these are useful as we expect this part to work in general >>> too, right? Thanks, >> >> Something like? >> >> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c >> index 6b3e1ee9d6..4440ac9e59 100644 >> --- a/system/guest-memfd-manager.c >> +++ b/system/guest-memfd-manager.c >> @@ -185,8 +185,14 @@ static int >> guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi >> { >> struct GuestMemfdReplayData *data = arg; >> ReplayRamPopulate replay_fn = data->fn; >> + int ret; >> >> - return replay_fn(section, data->opaque); >> + ret = replay_fn(section, data->opaque); >> + if (ret) { >> + trace_guest_memfd_rdm_replay_populated_cb(ret); >> + } >> + >> + return ret; >> } >> >> How about just adding some error output in >> guest_memfd_for_each_populated_section()/ >> guest_memfd_for_each_discarded_section() >> if the cb() (i.e. replay_populate()) returns error? > > this will do too, yes. Thanks, >