On Mon, Mar 3, 2025 at 1:29 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote: > > This series is built on top of the v3 write syscall support [1]. > > > > With James's KVM userfault [2], it is possible to handle stage-2 faults > > in guest_memfd in userspace. However, KVM itself also triggers faults > > in guest_memfd in some cases, for example: PV interfaces like kvmclock, > > PV EOI and page table walking code when fetching the MMIO instruction on > > x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] > > that KVM would be accessing those pages via userspace page tables. In > > order for such faults to be handled in userspace, guest_memfd needs to > > support userfaultfd. > > > > This series proposes a limited support for userfaultfd in guest_memfd: > > - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM` > > (as is fault support in general) > > - Only `page missing` event is currently supported > > - Userspace is supposed to respond to the event with the `write` > > syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting > > process. Note that we can't use `UFFDIO_COPY` here because > > userfaulfd code does not know how to prepare guest_memfd pages, eg > > remove them from direct map [4]. > > > > Not included in this series: > > - Proper interface for userfaultfd to recognise guest_memfd mappings > > - Proper handling of truncation cases after locking the page > > > > Request for comments: > > - Is it a sensible workflow for guest_memfd to resolve a userfault > > `page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One > > of the alternatives is teaching `UFFDIO_COPY` how to deal with > > guest_memfd pages. > > Probably not.. I don't see what protects a thread fault concurrently > during write() happening, seeing partial data. Since you check the page > cache it'll let it pass, but the partial page will be faulted in there. +1 here. I think the simplest way to make it work would be to also check folio_test_uptodate() in the userfaultfd_missing() check[1]. It would pair with kvm_gmem_mark_prepared() in the write() path[2]. I'm not sure if that's the "right" way, I think it would prevent threads from reading data as it is written. [1]: https://lore.kernel.org/kvm/20250303133011.44095-3-kalyazin@xxxxxxxxxx/ [2]: https://lore.kernel.org/kvm/20250303130838.28812-2-kalyazin@xxxxxxxxxx/ > I think we may need to either go with full MISSING or full MINOR traps. I agree, and just to clarify: you've basically implemented the MISSING model, just using write() to resolve userfaults instead of UFFDIO_COPY. The UFFDIO_CONTINUE implementation you have isn't really doing much; when the page cache has a page, the fault handler will populate the PTE for you. I think it's probably simpler to implement the MINOR model, where userspace can populate the page cache however it wants; write() is perfectly fine/natural. UFFDIO_CONTINUE just needs to populate PTEs for gmem, and the fault handler needs to check for the presence of PTEs. The `struct vm_fault` you have should contain enough info. > One thing to mention is we probably need MINOR sooner or later to support > gmem huge pages. The thing is for huge folios in gmem we can't rely on > missing in page cache, as we always need to allocate in hugetlb sizes. > > > - What is a way forward to make userfaultfd code aware of guest_memfd? > > I saw that Patrick hit a somewhat similar problem in [5] when trying > > to use direct map manipulation functions in KVM and was pointed by > > David at Elliot's guestmem library [6] that might include a shim for that. > > Would the library be the right place to expose required interfaces like > > `vma_is_gmem`? > > Not sure what's the best to do, but IIUC the current way this series uses > may not work as long as one tries to reference a kvm symbol from core mm.. > > One trick I used so far is leveraging vm_ops and provide hook function to > report specialties when it's gmem. In general, I did not yet dare to > overload vm_area_struct, but I'm thinking maybe vm_ops is more possible to > be accepted. E.g. something like this: > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 5e742738240c..b068bb79fdbc 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -653,8 +653,26 @@ struct vm_operations_struct { > */ > struct page *(*find_special_page)(struct vm_area_struct *vma, > unsigned long addr); > + /* > + * When set, return the allowed orders bitmask in faults of mmap() > + * ranges (e.g. for follow up huge_fault() processing). Drivers > + * can use this to bypass THP setups for specific types of VMAs. > + */ > + unsigned long (*get_supported_orders)(struct vm_area_struct *vma); > }; > > +static inline bool vma_has_supported_orders(struct vm_area_struct *vma) > +{ > + return vma->vm_ops && vma->vm_ops->get_supported_orders; > +} > + > +static inline unsigned long vma_get_supported_orders(struct vm_area_struct *vma) > +{ > + if (!vma_has_supported_orders(vma)) > + return 0; > + return vma->vm_ops->get_supported_orders(vma); > +} > + > > In my case I used that to allow gmem report huge page supports on faults. > > Said that, above only existed in my own tree so far, so I also don't know > whether something like that could be accepted (even if it'll work for you). I think it might be useful to implement an fs-generic MINOR mode. The fault handler is already easy enough to do generically (though it would become more difficult to determine if the "MINOR" fault is actually a MISSING fault, but at least for my userspace, the distinction isn't important. :)) So the question becomes: what should UFFDIO_CONTINUE look like? And I think it would be nice if UFFDIO_CONTINUE just called vm_ops->fault() to get the page we want to map and then mapped it, instead of having shmem-specific and hugetlb-specific versions (though maybe we need to keep the hugetlb specialization...). That would avoid putting kvm/gmem/etc. symbols in mm/userfaultfd code. I've actually wanted to do this for a while but haven't had a good reason to pursue it. I wonder if it can be done in a backwards-compatible fashion...