Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Chenyi Qiang <chenyi.qiang@xxxxxxxxx> · Fri, 10 Jan 2025 14:38:04 +0800

On 1/10/2025 8:58 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 9/1/25 15:29, Chenyi Qiang wrote:
>>
>>
>> On 1/9/2025 10:55 AM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 9/1/25 13:11, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 1/8/2025 7:20 PM, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> On 8/1/25 21:56, Chenyi Qiang wrote:
>>>>>>
>>>>>>
>>>>>> On 1/8/2025 12:48 PM, Alexey Kardashevskiy wrote:
>>>>>>> On 13/12/24 18:08, Chenyi Qiang wrote:
>>>>>>>> As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>>>>>>>> uncoordinated discard") highlighted, some subsystems like VFIO
>>>>>>>> might
>>>>>>>> disable ram block discard. However, guest_memfd relies on the
>>>>>>>> discard
>>>>>>>> operation to perform page conversion between private and shared
>>>>>>>> memory.
>>>>>>>> This can lead to stale IOMMU mapping issue when assigning a
>>>>>>>> hardware
>>>>>>>> device to a confidential VM via shared memory (unprotected memory
>>>>>>>> pages). Blocking shared page discard can solve this problem, but it
>>>>>>>> could cause guests to consume twice the memory with VFIO, which is
>>>>>>>> not
>>>>>>>> acceptable in some cases. An alternative solution is to convey
>>>>>>>> other
>>>>>>>> systems like VFIO to refresh its outdated IOMMU mappings.
>>>>>>>>
>>>>>>>> RamDiscardManager is an existing concept (used by virtio-mem) to
>>>>>>>> adjust
>>>>>>>> VFIO mappings in relation to VM page assignment. Effectively page
>>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>>> adding it
>>>>>>>> back in the other, so the similar work that needs to happen in
>>>>>>>> response
>>>>>>>> to virtio-mem changes needs to happen for page conversion events.
>>>>>>>> Introduce the RamDiscardManager to guest_memfd to achieve it.
>>>>>>>>
>>>>>>>> However, guest_memfd is not an object so it cannot directly
>>>>>>>> implement
>>>>>>>> the RamDiscardManager interface.
>>>>>>>>
>>>>>>>> One solution is to implement the interface in HostMemoryBackend.
>>>>>>>> Any
>>>>>>>
>>>>>>> This sounds about right.
> 
> btw I am using this for ages:
> 
> https://github.com/aik/qemu/commit/3663f889883d4aebbeb0e4422f7be5e357e2ee46
> 
> but I am not sure if this ever saw the light of the day, did not it?
> (ironically I am using it as a base for encrypted DMA :) )

Yeah, we are doing the same work. I saw a solution from Michael long
time ago (when there was still
a dedicated hostmem-memfd-private backend for restrictedmem/gmem)
(https://github.com/AMDESE/qemu/commit/3bf5255fc48d648724d66410485081ace41d8ee6)

For your patch, it only implement the interface for
HostMemoryBackendMemfd. Maybe it is more appropriate to implement it for
the parent object HostMemoryBackend, because besides the
MEMORY_BACKEND_MEMFD, other backend types like MEMORY_BACKEND_RAM and
MEMORY_BACKEND_FILE can also be guest_memfd-backed.

Think more about where to implement this interface. It is still
uncertain to me. As I mentioned in another mail, maybe ram device memory
region would be backed by guest_memfd if we support TEE IO iommufd MMIO
in future. Then a specific object is more appropriate. What's your opinion?

> 
>>>>>>>
>>>>>>>> guest_memfd-backed host memory backend can register itself in the
>>>>>>>> target
>>>>>>>> MemoryRegion. However, this solution doesn't cover the scenario
>>>>>>>> where a
>>>>>>>> guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend,
>>>>>>>> e.g.
>>>>>>>> the virtual BIOS MemoryRegion.
>>>>>>>
>>>>>>> What is this virtual BIOS MemoryRegion exactly? What does it look
>>>>>>> like
>>>>>>> in "info mtree -f"? Do we really want this memory to be DMAable?
>>>>>>
>>>>>> virtual BIOS shows in a separate region:
>>>>>>
>>>>>>     Root memory region: system
>>>>>>      0000000000000000-000000007fffffff (prio 0, ram): pc.ram KVM
>>>>>>      ...
>>>>>>      00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM
>>>>>
>>>>> Looks like a normal MR which can be backed by guest_memfd.
>>>>
>>>> Yes, virtual BIOS memory region is initialized by
>>>> memory_region_init_ram_guest_memfd() which will be backed by a
>>>> guest_memfd.
>>>>
>>>> The tricky thing is, for Intel TDX (not sure about AMD SEV), the
>>>> virtual
>>>> BIOS image will be loaded and then copied to private region.
>>>> After that,
>>>> the loaded image will be discarded and this region become useless.
>>>
>>> I'd think it is loaded as "struct Rom" and then copied to the MR-
>>> ram_guest_memfd() which does not leave MR useless - we still see
>>> "pc.bios" in the list so it is not discarded. What piece of code are you
>>> referring to exactly?
>>
>> Sorry for confusion, maybe it is different between TDX and SEV-SNP for
>> the vBIOS handling.
>>
>> In x86_bios_rom_init(), it initializes a guest_memfd-backed MR and loads
>> the vBIOS image to the shared part of the guest_memfd MR.
>> For TDX, it
>> will copy the image to private region (not the vBIOS guest_memfd MR
>> private part) and discard the shared part. So, although the memory
>> region still exists, it seems useless.
>> It is different for SEV-SNP, correct? Does SEV-SNP manage the vBIOS in
>> vBIOS guest_memfd private memory?
> 
> This is what it looks like on my SNP VM (which, I suspect, is the same
> as yours as hw/i386/pc.c does not distinguish Intel/AMD for this matter):

Yes, the memory region object is created on both TDX and SEV-SNP.

> 
>  Root memory region: system
>   0000000000000000-00000000000bffff (prio 0, ram): ram1 KVM gmemfd=20
>   00000000000c0000-00000000000dffff (prio 1, ram): pc.rom KVM gmemfd=27
>   00000000000e0000-000000001fffffff (prio 0, ram): ram1
> @00000000000e0000 KVM gmemfd=20
> ...
>   00000000ffc00000-00000000ffffffff (prio 0, ram): pc.bios KVM gmemfd=26
> 
> So the pc.bios MR exists and in use (hence its appearance in "info mtree
> -f").
> 
> 
> I added the gmemfd dumping:
> 
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -3446,6 +3446,9 @@ static void mtree_print_flatview(gpointer key,
> gpointer value,
>                  }
>              }
>          }
> +        if (mr->ram_block && mr->ram_block->guest_memfd >= 0) {
> +            qemu_printf(" gmemfd=%d", mr->ram_block->guest_memfd);
> +        }
> 

Then I think the virtual BIOS is another case not belonging to
HostMemoryBackend which convince us to implement the interface in a
specific object, no?

> 
>>>
>>>
>>>> So I
>>>> feel like this virtual BIOS should not be backed by guest_memfd?
>>>
>>>  From the above it sounds like the opposite, i.e. it should :)
>>>
>>>>>
>>>>>>      0000000100000000-000000017fffffff (prio 0, ram): pc.ram
>>>>>> @0000000080000000 KVM
>>>>>
>>>>> Anyway if there is no guest_memfd backing it and
>>>>> memory_region_has_ram_discard_manager() returns false, then the MR is
>>>>> just going to be mapped for VFIO as usual which seems... alright,
>>>>> right?
>>>>
>>>> Correct. As the vBIOS is backed by guest_memfd and we implement the RDM
>>>> for guest_memfd_manager, the vBIOS MR won't be mapped by VFIO.
>>>>
>>>> If we go with the HostMemoryBackend instead of guest_memfd_manager,
>>>> this
>>>> MR would be mapped by VFIO. Maybe need to avoid such vBIOS mapping, or
>>>> just ignore it since the MR is useless (but looks not so good).
>>>
>>> Sorry I am missing necessary details here, let's figure out the above.
>>>
>>>>
>>>>>
>>>>>
>>>>>> We also consider to implement the interface in HostMemoryBackend, but
>>>>>> maybe implement with guest_memfd region is more general. We don't
>>>>>> know
>>>>>> if any DMAable memory would belong to HostMemoryBackend although at
>>>>>> present it is.
>>>>>>
>>>>>> If it is more appropriate to implement it with HostMemoryBackend,
>>>>>> I can
>>>>>> change to this way.
>>>>>
>>>>> Seems cleaner imho.
>>>>
>>>> I can go this way.
>>
>> [...]
>>
>>>>>>>> +
>>>>>>>> +static int guest_memfd_rdm_replay_populated(const
>>>>>>>> RamDiscardManager
>>>>>>>> *rdm,
>>>>>>>> +                                            MemoryRegionSection
>>>>>>>> *section,
>>>>>>>> +                                            ReplayRamPopulate
>>>>>>>> replay_fn,
>>>>>>>> +                                            void *opaque)
>>>>>>>> +{
>>>>>>>> +    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
>>>>>>>> +    struct GuestMemfdReplayData data = { .fn =
>>>>>>>> replay_fn, .opaque =
>>>>>>>> opaque };
>>>>>>>> +
>>>>>>>> +    g_assert(section->mr == gmm->mr);
>>>>>>>> +    return guest_memfd_for_each_populated_section(gmm, section,
>>>>>>>> &data,
>>>>>>>> +
>>>>>>>> guest_memfd_rdm_replay_populated_cb);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection
>>>>>>>> *section, void *arg)
>>>>>>>> +{
>>>>>>>> +    struct GuestMemfdReplayData *data = arg;
>>>>>>>> +    ReplayRamDiscard replay_fn = data->fn;
>>>>>>>> +
>>>>>>>> +    replay_fn(section, data->opaque);
>>>>>>>
>>>>>>>
>>>>>>> guest_memfd_rdm_replay_populated_cb() checks for errors though.
>>>>>>
>>>>>> It follows current definiton of ReplayRamDiscard() and
>>>>>> ReplayRamPopulate() where replay_discard() doesn't return errors and
>>>>>> replay_populate() returns errors.
>>>>>
>>>>> A trace would be appropriate imho. Thanks,
>>>>
>>>> Sorry, can't catch you. What kind of info to be traced? The errors
>>>> returned by replay_populate()?
>>>
>>> Yeah. imho these are useful as we expect this part to work in general
>>> too, right? Thanks,
>>
>> Something like?
>>
>> diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
>> index 6b3e1ee9d6..4440ac9e59 100644
>> --- a/system/guest-memfd-manager.c
>> +++ b/system/guest-memfd-manager.c
>> @@ -185,8 +185,14 @@ static int
>> guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, voi
>>   {
>>       struct GuestMemfdReplayData *data = arg;
>>       ReplayRamPopulate replay_fn = data->fn;
>> +    int ret;
>>
>> -    return replay_fn(section, data->opaque);
>> +    ret = replay_fn(section, data->opaque);
>> +    if (ret) {
>> +        trace_guest_memfd_rdm_replay_populated_cb(ret);
>> +    }
>> +
>> +    return ret;
>>   }
>>
>> How about just adding some error output in
>> guest_memfd_for_each_populated_section()/
>> guest_memfd_for_each_discarded_section()
>> if the cb() (i.e. replay_populate()) returns error?
> 
> this will do too, yes. Thanks,
>