Re: [PATCH 2/7] guest_memfd: Introduce an object to manage the guest-memfd with RamDiscardManager

Xu Yilun <yilun.xu@xxxxxxxxxxxxxxx> · Thu, 6 Feb 2025 18:41:09 +0800

On Thu, Jan 30, 2025 at 11:28:11AM -0500, Peter Xu wrote:
> On Sun, Jan 26, 2025 at 11:34:29AM +0800, Xu Yilun wrote:
> > > Definitely not suggesting to install an invalid pointer anywhere.  The
> > > mapped pointer will still be valid for gmem for example, but the fault
> > > isn't.  We need to differenciate two things (1) virtual address mapping,
> > > then (2) permission and accesses on the folios / pages of the mapping.
> > > Here I think it's okay if the host pointer is correctly mapped.
> > > 
> > > For your private MMIO use case, my question is if there's no host pointer
> > > to be mapped anyway, then what's the benefit to make the MR to be ram=on?
> > > Can we simply make it a normal IO memory region?  The only benefit of a
> > 
> > The guest access to normal IO memory region would be emulated by QEMU,
> > while private assigned MMIO requires guest direct access via Secure EPT.
> > 
> > Seems the existing code doesn't support guest direct access if
> > mr->ram == false:
> 
> Ah it's about this, ok.
> 
> I am not sure what's the best approach, but IMHO it's still better we stick
> with host pointer always available when ram=on.  OTOH, VFIO private regions
> may be able to provide a special mark somewhere, just like when romd_mode
> was done previously as below (qemu 235e8982ad39), so that KVM should still
> apply these MRs even if they're not RAM.

Also good to me.

> 
> > 
> > static void kvm_set_phys_mem(KVMMemoryListener *kml,
> >                              MemoryRegionSection *section, bool add)
> > {
> >     [...]
> > 
> >     if (!memory_region_is_ram(mr)) {
> >         if (writable || !kvm_readonly_mem_allowed) {
> >             return;
> >         } else if (!mr->romd_mode) {
> >             /* If the memory device is not in romd_mode, then we actually want
> >              * to remove the kvm memory slot so all accesses will trap. */
> >             add = false;
> >         }
> >     }
> > 
> >     [...]
> > 
> >     /* register the new slot */
> >     do {
> > 
> >         [...]
> > 
> >         err = kvm_set_user_memory_region(kml, mem, true);
> >     }
> > }
> > 
> > > ram=on MR is, IMHO, being able to be accessed as RAM-like.  If there's no
> > > host pointer at all, I don't yet understand how that helps private MMIO
> > > from working.
> > 
> > I expect private MMIO not accessible from host, but accessible from
> > guest so has kvm_userspace_memory_region2 set. That means the resolving
> > of its PFN during EPT fault cannot depends on host pointer.
> > 
> > https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@xxxxxxxxxxxxxxx/
> 
> I'll leave this to KVM experts, but I actually didn't follow exactly on why
> mmu notifier is an issue to make , as I thought that was per-mm anyway, and KVM
> should logically be able to skip all VFIO private MMIO regions if affected.

I think this creates logical inconsistency. You builds the private MMIO
EPT mapping on fault based on the HVA<->HPA mapping, but doesn't follow
the HVA<->HPA mapping change. Why KVM believes the mapping on fault time
but doesn't on mmu notify time?

> This is a comment to this part of your commit message:
> 
>         Rely on userspace mapping also means private MMIO mapping should
>         follow userspace mapping change via mmu_notifier. This conflicts
>         with the current design that mmu_notifier never impacts private
>         mapping. It also makes no sense to support mmu_notifier just for
>         private MMIO, private MMIO mapping should be fixed when CoCo-VM
>         accepts the private MMIO, any following mapping change without
>         guest permission should be invalid.
> 
> So I don't yet see a hard-no of reusing userspace mapping even if they're
> not faultable as of now - what if they can be faultable in the future?  I

The first commit of guest_memfd emphasize a lot on the benifit of
decoupling KVM mapping from host mapping. My understanding is even if
guest memfd can be faultable later, KVM should still work in a way
without userspace mapping.

> am not sure..
> 
> OTOH, I also don't think we need KVM_SET_USER_MEMORY_REGION3 anyway.. The
> _REGION2 API is already smart enough to leave some reserved fields:
> 
> /* for KVM_SET_USER_MEMORY_REGION2 */
> struct kvm_userspace_memory_region2 {
> 	__u32 slot;
> 	__u32 flags;
> 	__u64 guest_phys_addr;
> 	__u64 memory_size;
> 	__u64 userspace_addr;
> 	__u64 guest_memfd_offset;
> 	__u32 guest_memfd;
> 	__u32 pad1;
> 	__u64 pad2[14];
> };
> 
> I think we _could_ reuse some pad*?  Reusing guest_memfd field sounds error
> prone to me.

It truly is. I'm expecting some suggestions here.

Thanks,
Yilun

> 
> Not sure it could be easier if it's not guest_memfd* but fd + fd_offset
> since the start.  But I guess when introducing _REGION2 we didn't expect
> MMIO private regions come so soon..
> 
> Thanks,
> 
> -- 
> Peter Xu
>