Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing

James Houghton <jthoughton@xxxxxxxxxx> · Wed, 5 Mar 2025 11:35:27 -0800

On Mon, Mar 3, 2025 at 1:29 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote:
> > This series is built on top of the v3 write syscall support [1].
> >
> > With James's KVM userfault [2], it is possible to handle stage-2 faults
> > in guest_memfd in userspace.  However, KVM itself also triggers faults
> > in guest_memfd in some cases, for example: PV interfaces like kvmclock,
> > PV EOI and page table walking code when fetching the MMIO instruction on
> > x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
> > that KVM would be accessing those pages via userspace page tables.  In
> > order for such faults to be handled in userspace, guest_memfd needs to
> > support userfaultfd.
> >
> > This series proposes a limited support for userfaultfd in guest_memfd:
> >  - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
> >    (as is fault support in general)
> >  - Only `page missing` event is currently supported
> >  - Userspace is supposed to respond to the event with the `write`
> >    syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
> >    process.   Note that we can't use `UFFDIO_COPY` here because
> >    userfaulfd code does not know how to prepare guest_memfd pages, eg
> >    remove them from direct map [4].
> >
> > Not included in this series:
> >  - Proper interface for userfaultfd to recognise guest_memfd mappings
> >  - Proper handling of truncation cases after locking the page
> >
> > Request for comments:
> >  - Is it a sensible workflow for guest_memfd to resolve a userfault
> >    `page missing` event with `write` syscall + `UFFDIO_CONTINUE`?  One
> >    of the alternatives is teaching `UFFDIO_COPY` how to deal with
> >    guest_memfd pages.
>
> Probably not..  I don't see what protects a thread fault concurrently
> during write() happening, seeing partial data.  Since you check the page
> cache it'll let it pass, but the partial page will be faulted in there.

+1 here.

I think the simplest way to make it work would be to also check
folio_test_uptodate() in the userfaultfd_missing() check[1]. It would
pair with kvm_gmem_mark_prepared() in the write() path[2].

I'm not sure if that's the "right" way, I think it would prevent
threads from reading data as it is written.

[1]: https://lore.kernel.org/kvm/20250303133011.44095-3-kalyazin@xxxxxxxxxx/
[2]: https://lore.kernel.org/kvm/20250303130838.28812-2-kalyazin@xxxxxxxxxx/

> I think we may need to either go with full MISSING or full MINOR traps.

I agree, and just to clarify: you've basically implemented the MISSING
model, just using write() to resolve userfaults instead of
UFFDIO_COPY. The UFFDIO_CONTINUE implementation you have isn't really
doing much; when the page cache has a page, the fault handler will
populate the PTE for you.

I think it's probably simpler to implement the MINOR model, where
userspace can populate the page cache however it wants; write() is
perfectly fine/natural. UFFDIO_CONTINUE just needs to populate PTEs
for gmem, and the fault handler needs to check for the presence of
PTEs. The `struct vm_fault` you have should contain enough info.

> One thing to mention is we probably need MINOR sooner or later to support
> gmem huge pages.  The thing is for huge folios in gmem we can't rely on
> missing in page cache, as we always need to allocate in hugetlb sizes.
>
> >  - What is a way forward to make userfaultfd code aware of guest_memfd?
> >    I saw that Patrick hit a somewhat similar problem in [5] when trying
> >    to use direct map manipulation functions in KVM and was pointed by
> >    David at Elliot's guestmem library [6] that might include a shim for that.
> >    Would the library be the right place to expose required interfaces like
> >    `vma_is_gmem`?
>
> Not sure what's the best to do, but IIUC the current way this series uses
> may not work as long as one tries to reference a kvm symbol from core mm..
>
> One trick I used so far is leveraging vm_ops and provide hook function to
> report specialties when it's gmem.  In general, I did not yet dare to
> overload vm_area_struct, but I'm thinking maybe vm_ops is more possible to
> be accepted.  E.g. something like this:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5e742738240c..b068bb79fdbc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -653,8 +653,26 @@ struct vm_operations_struct {
>          */
>         struct page *(*find_special_page)(struct vm_area_struct *vma,
>                                           unsigned long addr);
> +       /*
> +        * When set, return the allowed orders bitmask in faults of mmap()
> +        * ranges (e.g. for follow up huge_fault() processing).  Drivers
> +        * can use this to bypass THP setups for specific types of VMAs.
> +        */
> +       unsigned long (*get_supported_orders)(struct vm_area_struct *vma);
>  };
>
> +static inline bool vma_has_supported_orders(struct vm_area_struct *vma)
> +{
> +       return vma->vm_ops && vma->vm_ops->get_supported_orders;
> +}
> +
> +static inline unsigned long vma_get_supported_orders(struct vm_area_struct *vma)
> +{
> +       if (!vma_has_supported_orders(vma))
> +               return 0;
> +       return vma->vm_ops->get_supported_orders(vma);
> +}
> +
>
> In my case I used that to allow gmem report huge page supports on faults.
>
> Said that, above only existed in my own tree so far, so I also don't know
> whether something like that could be accepted (even if it'll work for you).

I think it might be useful to implement an fs-generic MINOR mode. The
fault handler is already easy enough to do generically (though it
would become more difficult to determine if the "MINOR" fault is
actually a MISSING fault, but at least for my userspace, the
distinction isn't important. :)) So the question becomes: what should
UFFDIO_CONTINUE look like?

And I think it would be nice if UFFDIO_CONTINUE just called
vm_ops->fault() to get the page we want to map and then mapped it,
instead of having shmem-specific and hugetlb-specific versions (though
maybe we need to keep the hugetlb specialization...). That would avoid
putting kvm/gmem/etc. symbols in mm/userfaultfd code.

I've actually wanted to do this for a while but haven't had a good
reason to pursue it. I wonder if it can be done in a
backwards-compatible fashion...