Re: [PATCH] perf: map pages in advance

David Hildenbrand <david@xxxxxxxxxx> · Fri, 29 Nov 2024 15:09:49 +0100

I just tested it and write() works fine, it uses uaccess afaict as part of the
lib/iov_iter.c code:

generic_perform_write()
-> copy_folio_from_iter_atomic()
-> copy_page_from_iter_atomic()
-> __copy_from_iter()
-> copy_from_user_iter()
-> raw_copy_from_user()
-> copy_user_generic()
-> [uaccess asm]

Ah yes. O_DIRECT is the problematic bit I suspect, which will use GUP.

Ptrace access and friends should also no longer work on these pages, likely
that's tolerable.

Yeah Peter can interject if not, but I'd be _very_ surprised if anybody expects
to be able to ptrace perf counter mappings in another process (it'd be kind of
totally insane to do that anyway since it's a ring buffer that needs special
handing anyway).

I think so as well. Disallowing GUP has some side effects, like not 
getting these pages included in a coredump etc ... at least I hope 
nobody uses O_DIRECT on them.

If we can't do pfnmap, and we definitely can't do mixedmap (because it's
basically entirely equivalent in every way to just faulting in the pages as
before and requires the same hacks) then I will have to go back to the drawing
board or somehow change the faulting code.

This really sucks.

I'm not quite sure I even understand why we don't allow GUP used _just for
pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
whatever mapped it will maintain the lifetime.

What a mess...

Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I
raised my refcounting comment earlier) just read recent:

commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date:   Wed Sep 11 17:11:23 2024 -0700

      mm: avoid leaving partial pfn mappings around in error case
      As Jann points out, PFN mappings are special, because unlike normal
      memory mappings, there is no lifetime information associated with the
      mapping - it is just a raw mapping of PFNs with no reference counting of
      a 'struct page'.

I'm _very_ aware of this, having worked extensively on things around this kind
of issue recently (was a little bit involved with the above one too :), and
explicitly zap on error in this patch to ensure we leave no broken code paths.

I agree it's horrible, but we need to have a way of mapping this memory without
having to 'trick' the faulting mechanism to behave correctly.

What's completely "surprising" to me is, if there is no page_mkwrite, but
the VMA is writable, then just map the PTE writable ...

I've had a number of surprises on this journey :)

To make sure I understand what you mean do you mean the whole:

do_wp_page()
-> wp_page_shared()
-> if (page_mkwrite) { ... } else { wp_page_reuse(); }
-> wp_page_reuse()
-> maybe_mkwrite() [hey VM_WRITE is set, so yes make writable!]

Path?

Yes. I can see how it might be relevant (mprotect(PROT_READ) + 
mprotect(READ|WRITE)), but it's a bit counter-intuitive ... to default 
to "writable".

[...]

So overall - given all the above and the fact that the alternatives are _even
worse_ (having to spuriously folio lock if that'll even work, totally
unnecessary and semantically incorrect folio-fication or a complete rework of
mapping) - while you might be sicked by this use of the VM_PFNMAP - are you ok
with the patch, all things considered? :)

It hurts my eyes, and feels like a step into the wrong direction. But I 
don't really see how to do it differently right now.

Can we add a comment/TODO in there that using remap_pfn_range this to 
map a kernel allocation really is suboptimal and we should switch to 
something better once we figured the details out?

BTW, I think the whole reason vm_insert_pages() touches the mapcount is 
because we didn't really have a way to identify during unmap that we 
must not adjust it either. A page->type will be able to sort that out 
(while still allowing to refcount them for now).

Actually, the whole thing is on my todo list, because messing with the 
RMAP with vm_insert_pages() doesn't make any sense (whereby the refcount 
might!). See the TODO I added in __folio_rmap_sanity_checks().

--
Cheers,

David / dhildenb