Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 10, 2025 at 06:12:22PM +0000, Nikita Kalyazin wrote:
> 
> 
> On 05/03/2025 20:29, Peter Xu wrote:
> > On Wed, Mar 05, 2025 at 11:35:27AM -0800, James Houghton wrote:
> > > I think it might be useful to implement an fs-generic MINOR mode. The
> > > fault handler is already easy enough to do generically (though it
> > > would become more difficult to determine if the "MINOR" fault is
> > > actually a MISSING fault, but at least for my userspace, the
> > > distinction isn't important. :)) So the question becomes: what should
> > > UFFDIO_CONTINUE look like?
> > > 
> > > And I think it would be nice if UFFDIO_CONTINUE just called
> > > vm_ops->fault() to get the page we want to map and then mapped it,
> > > instead of having shmem-specific and hugetlb-specific versions (though
> > > maybe we need to keep the hugetlb specialization...). That would avoid
> > > putting kvm/gmem/etc. symbols in mm/userfaultfd code.
> > > 
> > > I've actually wanted to do this for a while but haven't had a good
> > > reason to pursue it. I wonder if it can be done in a
> > > backwards-compatible fashion...
> > 
> > Yes I also thought about that. :)
> 
> Hi Peter, hi James.  Thanks for pointing at the race condition!
> 
> I did some experimentation and it indeed looks possible to call
> vm_ops->fault() from userfault_continue() to make it generic and decouple
> from KVM, at least for non-hugetlb cases.  One thing is we'd need to prevent
> a recursive handle_userfault() invocation, which I believe can be solved by
> adding a new VMF flag to ignore the userfault path when the fault handler is
> called from userfault_continue().  I'm open to a more elegant solution
> though.

It sounds working to me.  Adding fault flag can also be seen as part of
extension of vm_operations_struct ops.  So we could consider reusing
fault() API indeed.

> 
> Regarding usage of the MINOR notification, in what case do you recommend
> sending it?  If following the logic implemented in shmem and hugetlb, ie if
> the page is _present_ in the pagecache, I can't see how it is going to work

It could be confusing when reading that chunk of code, because it looks
like it notifies minor fault when cache hit. But the critical part here is
that we rely on the pgtable missing causing the fault() to trigger first.
So it's more like "cache hit && pgtable missing" for minor fault.

> with the write syscall, as we'd like to know when the page is _missing_ in
> order to respond with the population via the write.  If going against
> shmem/hugetlb logic, and sending the MINOR event when the page is missing
> from the pagecache, how would it solve the race condition problem?

Should be easier we stick with mmap() rather than write().  E.g. for shmem
case of current code base:

	if (folio && vma && userfaultfd_minor(vma)) {
		if (!xa_is_value(folio))
			folio_put(folio);
		*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
		return 0;
	}

vma is only availble if vmf!=NULL, aka in fault context.  With that, in
write() to shmem inodes, nothing will generate a message, because minor
fault so far is only about pgtable missing.  It needs to be mmap()ed first,
and has nothing yet to do with write() syscalls.

> 
> Also, where would the check for the folio_test_uptodate() mentioned by James
> fit into here?  Would it only be used for fortifying the MINOR (present)
> against the race?
> 
> > When Axel added minor fault, it's not a major concern as it's the only fs
> > that will consume the feature anyway in the do_fault() path - hugetlbfs has
> > its own path to take care of.. even until now.
> > 
> > And there's some valid points too if someone would argue to put it there
> > especially on folio lock - do that in shmem.c can avoid taking folio lock
> > when generating minor fault message.  It might make some difference when
> > the faults are heavy and when folio lock is frequently taken elsewhere too.
> 
> Peter, could you expand on this?  Are you referring to the following
> (shmem_get_folio_gfp)?
> 
> 	if (folio) {
> 		folio_lock(folio);
> 
> 		/* Has the folio been truncated or swapped out? */
> 		if (unlikely(folio->mapping != inode->i_mapping)) {
> 			folio_unlock(folio);
> 			folio_put(folio);
> 			goto repeat;
> 		}
> 		if (sgp == SGP_WRITE)
> 			folio_mark_accessed(folio);
> 		if (folio_test_uptodate(folio))
> 			goto out;
> 		/* fallocated folio */
> 		if (sgp != SGP_READ)
> 			goto clear;
> 		folio_unlock(folio);
> 		folio_put(folio);
> 	}
> 
> Could you explain in what case the lock can be avoided?  AFAIC, the function
> is called by both the shmem fault handler and userfault_continue().

I think you meant the UFFDIO_CONTINUE side of things.  I agree with you, we
always need the folio lock.

What I was saying is the trapping side, where the minor fault message can
be generated without the folio lock now in case of shmem.  It's about
whether we could generalize the trapping side, so handle_mm_fault() can
generate the minor fault message instead of by shmem.c.

If the only concern is "referring to a module symbol from core mm", then
indeed the trapping side should be less of a concern anyway, because the
trapping side (when in the module codes) should always be able to reference
mm functions.

Actually.. if we have a fault() flag introduced above, maybe we can
generalize the trap side altogether without the folio lock overhead.  When
the flag set, if we can always return the folio unlocked (as long as
refcount held), then in UFFDIO_CONTINUE ioctl we can lock it.

> 
> > It might boil down to how many more FSes would support minor fault, and
> > whether we would care about such difference at last to shmem users. If gmem
> > is the only one after existing ones, IIUC there's still option we implement
> > it in gmem code.  After all, I expect the change should be very under
> > control (<20 LOCs?)..
> > 
> > --
> > Peter Xu
> > 
> 

-- 
Peter Xu





[Index of Archives]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Device Mapper]

  Powered by Linux