Re: [PATCH RFC 4/4] mm: guest_memfd: Add ability for mmap'ing pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 16.08.24 19:45, Ackerley Tng wrote:

David Hildenbrand <david@xxxxxxxxxx> writes:

On 15.08.24 09:24, Fuad Tabba wrote:
Hi David,

Hi!


On Tue, 6 Aug 2024 at 14:51, David Hildenbrand <david@xxxxxxxxxx> wrote:


-     if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
+     if (!ops->accessible && (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)) {
                r = guest_memfd_folio_private(folio);
                if (r)
                        goto out_err;
@@ -107,6 +109,82 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
    }
    EXPORT_SYMBOL_GPL(guest_memfd_grab_folio);

+int guest_memfd_make_inaccessible(struct file *file, struct folio *folio)
+{
+     unsigned long gmem_flags = (unsigned long)file->private_data;
+     unsigned long i;
+     int r;
+
+     unmap_mapping_folio(folio);
+
+     /**
+      * We can't use the refcount. It might be elevated due to
+      * guest/vcpu trying to access same folio as another vcpu
+      * or because userspace is trying to access folio for same reason

As discussed, that's insufficient. We really have to drive the refcount
to 1 -- the single reference we expect.

What is the exact problem you are running into here? Who can just grab a
reference and maybe do nasty things with it?

I was wondering, why do we need to check the refcount? Isn't it enough
to check for page_mapped() || page_maybe_dma_pinned(), while holding
the folio lock?

Thank you Fuad for asking!


(folio_mapped() + folio_maybe_dma_pinned())

Not everything goes trough FOLL_PIN. vmsplice() is an example, or just
some very simple read/write through /proc/pid/mem. Further, some
O_DIRECT implementations still don't use FOLL_PIN.

So if you see an additional folio reference, as soon as you mapped that
thing to user space, you have to assume that it could be someone
reading/writing that memory in possibly sane context. (vmsplice() should
be using FOLL_PIN|FOLL_LONGTERM, but that's a longer discussion)


Thanks David for the clarification, this example is very helpful!

IIUC folio_lock() isn't a prerequisite for taking a refcount on the
folio.

Right, to do folio_lock() you only have to guarantee that the folio cannot get freed concurrently. So you piggyback on another reference (you hold indirectly).


Even if we are able to figure out a "safe" refcount, and check that the
current refcount == "safe" refcount before removing from direct map,
what's stopping some other part of the kernel from taking a refcount
just after the check happens and causing trouble with the folio's
removal from direct map?

Once the page was unmapped from user space, and there were no additional references (e.g., GUP, whatever), any new references can only be (should, unless BUG :) ) temporary speculative references that should not try accessing page content, and that should back off if the folio is not deemed interesting or cannot be locked. (e.g., page migration/compaction/offlining).

Of course, there are some corner cases (kgdb, hibernation, /proc/kcore), but most of these can be dealt with in one way or the other (make these back off and not read/write page content, similar to how we handled it for secretmem).

These (kgdb, /proc/kcore) might not even take a folio reference, they just "access stuff" and we only have to teach them to "not access that".


(noting that also folio_maybe_dma_pinned() can have false positives in
some cases due to speculative references or *many* references).

Are false positives (speculative references) okay since it's better to
be safe than remove from direct map prematurely?

folio_maybe_dma_pinned() is primarily used in fork context. Copying more (if the folio maybe pinned and, therefore, must not get COW-shared with other processes and must instead create a private page copy) is the "better safe than sorry". So false positives (that happen rarely) are tolerable.

Regading the directmap, it would -- just like with additional references -- detect that the page cannot currently be removed from the direct map. It's similarly "better safe than sorry", but here means that we likely must retry if we cannot easily fallback to something else like for the fork+COW case.

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [Linux for Sparc]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux