Re: [PATCH RFC 4/4] mm: guest_memfd: Add ability for mmap'ing pages

David Hildenbrand <david@xxxxxxxxxx> · Fri, 9 Aug 2024 09:16:45 +0200




On 09.08.24 00:26, Elliot Berman wrote:
On Thu, Aug 08, 2024 at 11:55:15PM +0200, David Hildenbrand wrote:
On 08.08.24 23:41, Elliot Berman wrote:
On Wed, Aug 07, 2024 at 06:12:00PM +0200, David Hildenbrand wrote:
On 06.08.24 19:14, Elliot Berman wrote:
On Tue, Aug 06, 2024 at 03:51:22PM +0200, David Hildenbrand wrote:
-	if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
+	if (!ops->accessible && (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)) {
     		r = guest_memfd_folio_private(folio);
     		if (r)
     			goto out_err;
@@ -107,6 +109,82 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
     }
     EXPORT_SYMBOL_GPL(guest_memfd_grab_folio);
+int guest_memfd_make_inaccessible(struct file *file, struct folio *folio)
+{
+	unsigned long gmem_flags = (unsigned long)file->private_data;
+	unsigned long i;
+	int r;
+
+	unmap_mapping_folio(folio);
+
+	/**
+	 * We can't use the refcount. It might be elevated due to
+	 * guest/vcpu trying to access same folio as another vcpu
+	 * or because userspace is trying to access folio for same reason

As discussed, that's insufficient. We really have to drive the refcount to 1
-- the single reference we expect.

What is the exact problem you are running into here? Who can just grab a
reference and maybe do nasty things with it?


Right, I remember we had discussed it. The problem I faced was if 2
vcpus fault on same page, they would race to look up the folio in
filemap, increment refcount, then try to lock the folio. One of the
vcpus wins the lock, while the other waits. The vcpu that gets the
lock vcpu will see the elevated refcount.

I was in middle of writing an explanation why I think this is best
approach and realized I think it should be possible to do
shared->private conversion and actually have single reference. There
would be some cost to walk through the allocated folios and convert them
to private before any vcpu runs. The approach I had gone with was to
do conversions as late as possible.

We certainly have to support conversion while the VCPUs are running.

The VCPUs might be able to avoid grabbing a folio reference for the
conversion and only do the folio_lock(): as long as we have a guarantee that
we will disallow freeing the folio in gmem, for example, by syncing against
FALLOC_FL_PUNCH_HOLE.

So if we can rely on the "gmem" reference to the folio that cannot go away
while we do what we do, we should be fine.

<random though>

Meanwhile, I was thinking if we would want to track the references we
hand out to "safe" users differently.

Safe references would only be references that would survive a
private<->shared conversion, like KVM MMU mappings maybe?

KVM would then have to be thought to return these gmem references
differently.

The idea would be to track these "safe" references differently
(page->private?) and only allow dropping *our* guest_memfd reference if all
these "safe" references are gone. That is, FALLOC_FL_PUNCH_HOLE would also
fail if there are any "safe" reference remaining.

<\random though>


I didn't find a path in filemap where we can grab folio without
increasing its refcount. I liked the idea of keeping track of a "safe"
refcount, but I believe there is a small window to race comparing the
main folio refcount and the "safe" refcount.

There are various possible models. To detect unexpected references, we could
either use

folio_ref_count(folio) == gmem_folio_safe_ref_count(folio) + 1

[we increment both ref counter]

or

folio_ref_count(folio) == 1

[we only increment the safe refcount and let other magic handle it as
described]

A vcpu could have
incremented the main folio refcount and on the way to increment the safe
refcount. Before that happens, another thread does the comparison and
sees a mismatch.

Likely there won't be a way around coming up with code that is able to deal
with such temporary, "speculative" folio references.

In the simplest case, these references will be obtained from our gmem code
only, and we'll have to detect that it happened and retry (a seqcount would
be a naive solution).

In the complex case, these references are temporarily obtained from other
core-mm code -- using folio_try_get(). We can minimize some of them
(speculative references from GUP or the pagecache), and try optimizing
others (PFN walkers like page migration).

But likely we'll need some retry magic, at least initially.


I thought retry magic would not fly. I'll try this out.

Any details why? At least the "other gmem code is currently taking a 
speculative reference" should be handable, these speculative references 
all happen from gmem code and it should be under our control.

We can protect against some core-mm speculative references (GUP, 
page-cache): after we allocated pages for gmem, and a RCU grace period 
passed, these can no longer happen from old context that previously had 
these pages allocated before gmem allocated them.

Other folio_try_get() users like memory offlining or page migration are 
more problematic. In general, the assumption is that they will give up 
quickly, for example when realizing that a folio is not LRU or non 
"non-lru-movable" -- which is the case for gmem-allocated pages.

Yes, no retry magic would be very much preferred, but as soon as we want 
to map these folios to user space and have GUP work on them (IOW, we 
have to make the folio refcount usable), we cannot easily block all 
speculative references from core-mm, for example, by freezing the 
refcount at 0. Long term, we might find ways to just block these 
speculative references more efficiently / differently.

--
Cheers,

David / dhildenb