Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing

Nikita Kalyazin <kalyazin@xxxxxxxxxx> · Thu, 13 Mar 2025 15:25:16 +0000






On 12/03/2025 19:32, Peter Xu wrote:
On Wed, Mar 12, 2025 at 05:07:25PM +0000, Nikita Kalyazin wrote:
However if MISSING is not registered, the kernel will auto-populate with a
clear page, ie there is no way to inject custom content from userspace.  To
explain my use case a bit more, the population thread will be trying to copy
all guest memory proactively, but there will inevitably be cases where a
page is accessed through pgtables _before_ it gets populated.  It is not
desirable for such access to result in a clear page provided by the kernel.

IMHO populating with a zero page in the page cache is fine. It needs to
make sure all accesses will go via the pgtable, as discussed below in my
previous email [1], then nobody will be able to see the zero page, not
until someone updates the content then follow up with a CONTINUE to install
the pgtable entry.

If there is any way that the page can be accessed without the pgtable
installation, minor faults won't work indeed.

I think I see what you mean now.  I agree, it isn't the end of the world 
if the kernel clears the page and then userspace overwrites it.

The way I see it is:

@@ -400,20 +401,26 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
        if (WARN_ON_ONCE(folio_test_large(folio))) {
                ret = VM_FAULT_SIGBUS;
                goto out_folio;
        }

        if (!folio_test_uptodate(folio)) {
                clear_highpage(folio_page(folio, 0));
                kvm_gmem_mark_prepared(folio);
        }

+       if (userfaultfd_minor(vmf->vma)) {
+               folio_unlock(folio);
+               filemap_invalidate_unlock_shared(inode->i_mapping);
+               return handle_userfault(vmf, VM_UFFD_MISSING);
+       }
+
        vmf->page = folio_file_page(folio, vmf->pgoff);

 out_folio:
        if (ret != VM_FAULT_LOCKED) {
                folio_unlock(folio);
                folio_put(folio);
        }

On the first fault (cache miss), the kernel will allocate/add/clear the 
page (as there is no MISSING trap now), and once the page is in the 
cache, a MINOR event will be sent for userspace to copy its content. 
Please let me know if this is an acceptable semantics.

Since userspace is getting notified after KVM calls 
kvm_gmem_mark_prepared(), which removes the page from the direct map 
[1], userspace can't use write() to populate the content because write() 
relies on direct map [2].  However userspace can do a plain memcpy that 
would use user pagetables instead.  This forces userspace to respond to 
stage-2 and VMA faults in guest_memfd differently, via write() and 
memcpy respectively.  It doesn't seem like a significant problem though.

I believe, with this approach the original race condition is gone 
because UFFD messages are only sent on cache hit and it is up to 
userspace to serialise writes.  Please correct me if I'm wrong here.

[1] 
https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@xxxxxxxxxxxx/T/#mdf41fe2dc33332e9c500febd47e14ae91ad99724
[2] 
https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@xxxxxxxxxx/T/#mf5d794aa31d753cbc73e193628f31e418051983d


as long as the content can only be accessed from the pgtable (either via
mmap() or GUP on top of it), then afaiu it could work similarly like
MISSING faults, because anything trying to access it will be trapped.

[1]

--
Peter Xu