On 16.08.24 19:45, Ackerley Tng wrote:
David Hildenbrand <david@xxxxxxxxxx> writes:
On 15.08.24 09:24, Fuad Tabba wrote:
Hi David,
Hi!
On Tue, 6 Aug 2024 at 14:51, David Hildenbrand <david@xxxxxxxxxx> wrote:
- if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) {
+ if (!ops->accessible && (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)) {
r = guest_memfd_folio_private(folio);
if (r)
goto out_err;
@@ -107,6 +109,82 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
}
EXPORT_SYMBOL_GPL(guest_memfd_grab_folio);
+int guest_memfd_make_inaccessible(struct file *file, struct folio *folio)
+{
+ unsigned long gmem_flags = (unsigned long)file->private_data;
+ unsigned long i;
+ int r;
+
+ unmap_mapping_folio(folio);
+
+ /**
+ * We can't use the refcount. It might be elevated due to
+ * guest/vcpu trying to access same folio as another vcpu
+ * or because userspace is trying to access folio for same reason
As discussed, that's insufficient. We really have to drive the refcount
to 1 -- the single reference we expect.
What is the exact problem you are running into here? Who can just grab a
reference and maybe do nasty things with it?
I was wondering, why do we need to check the refcount? Isn't it enough
to check for page_mapped() || page_maybe_dma_pinned(), while holding
the folio lock?
Thank you Fuad for asking!
(folio_mapped() + folio_maybe_dma_pinned())
Not everything goes trough FOLL_PIN. vmsplice() is an example, or just
some very simple read/write through /proc/pid/mem. Further, some
O_DIRECT implementations still don't use FOLL_PIN.
So if you see an additional folio reference, as soon as you mapped that
thing to user space, you have to assume that it could be someone
reading/writing that memory in possibly sane context. (vmsplice() should
be using FOLL_PIN|FOLL_LONGTERM, but that's a longer discussion)
Thanks David for the clarification, this example is very helpful!
IIUC folio_lock() isn't a prerequisite for taking a refcount on the
folio.
Right, to do folio_lock() you only have to guarantee that the folio
cannot get freed concurrently. So you piggyback on another reference
(you hold indirectly).
Even if we are able to figure out a "safe" refcount, and check that the
current refcount == "safe" refcount before removing from direct map,
what's stopping some other part of the kernel from taking a refcount
just after the check happens and causing trouble with the folio's
removal from direct map?
Once the page was unmapped from user space, and there were no additional
references (e.g., GUP, whatever), any new references can only be
(should, unless BUG :) ) temporary speculative references that should
not try accessing page content, and that should back off if the folio is
not deemed interesting or cannot be locked. (e.g., page
migration/compaction/offlining).
Of course, there are some corner cases (kgdb, hibernation, /proc/kcore),
but most of these can be dealt with in one way or the other (make these
back off and not read/write page content, similar to how we handled it
for secretmem).
These (kgdb, /proc/kcore) might not even take a folio reference, they
just "access stuff" and we only have to teach them to "not access that".
(noting that also folio_maybe_dma_pinned() can have false positives in
some cases due to speculative references or *many* references).
Are false positives (speculative references) okay since it's better to
be safe than remove from direct map prematurely?
folio_maybe_dma_pinned() is primarily used in fork context. Copying more
(if the folio maybe pinned and, therefore, must not get COW-shared with
other processes and must instead create a private page copy) is the
"better safe than sorry". So false positives (that happen rarely) are
tolerable.
Regading the directmap, it would -- just like with additional references
-- detect that the page cannot currently be removed from the direct map.
It's similarly "better safe than sorry", but here means that we likely
must retry if we cannot easily fallback to something else like for the
fork+COW case.
--
Cheers,
David / dhildenb