On 29.11.24 14:38, Lorenzo Stoakes wrote:
On Fri, Nov 29, 2024 at 02:24:24PM +0100, David Hildenbrand wrote:
On 29.11.24 14:19, Lorenzo Stoakes wrote:
On Fri, Nov 29, 2024 at 02:12:23PM +0100, David Hildenbrand wrote:
On 29.11.24 14:02, Lorenzo Stoakes wrote:
On Fri, Nov 29, 2024 at 01:59:01PM +0100, David Hildenbrand wrote:
On 29.11.24 13:55, Lorenzo Stoakes wrote:
On Fri, Nov 29, 2024 at 01:45:42PM +0100, David Hildenbrand wrote:
On 29.11.24 13:26, Peter Zijlstra wrote:
On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
Well, I think we simply will want vm_insert_pages_prot() that stops treating
these things like folios :) . *likely* we'd want a distinct memdesc/type.
We could start that work right now by making some user (iouring,
ring_buffer) set a new page->_type, and checking that in
vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
and the mapcount.
Because then, we can just make all the relevant drivers set the type, refuse
in vm_insert_pages_prot() anything that doesn't have the type set, and
refuse in vm_normal_page() any pages with this memdesc.
Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
these things will stop working, I hope that is not a problem.
Well... perf-tool likes to call write() upon these pages in order to
write out the data from the mmap() into a file.
I'm confused about what you mean, write() using the fd should work fine, how
would they interact with the mmap? I mean be making a silly mistake here
write() to file from the mmap()'ed address range to *some* file.
Yeah sorry my brain melted down briefly, for some reason was thinking of read()
writing into the buffer...
This will GUP the pages you inserted.
GUP does not work on PFNMAP.
Well it _does_ if struct page **pages is set to NULL :)
Hm? :)
check_vma_flags() unconditionally refuses VM_PFNMAP.
Ha, funny with my name all over git blame there... ok yup missed this, the
vm_normal_page() == NULL stuff must but for mixed map (and those other weird
cases I think you can get0...
Well good. Where is write() invoking GUP? I'm kind of surprised it's not just
using uaccess?
One thing to note is I did run all the perf tests with no issues whatsoever. You
would _think_ this would have come up...
I'm editing some test code to explicitly write() from the buffer anyway to see.
I just tested it and write() works fine, it uses uaccess afaict as part of the
lib/iov_iter.c code:
generic_perform_write()
-> copy_folio_from_iter_atomic()
-> copy_page_from_iter_atomic()
-> __copy_from_iter()
-> copy_from_user_iter()
-> raw_copy_from_user()
-> copy_user_generic()
-> [uaccess asm]
Ah yes. O_DIRECT is the problematic bit I suspect, which will use GUP.
Ptrace access and friends should also no longer work on these pages,
likely that's tolerable.
If we can't do pfnmap, and we definitely can't do mixedmap (because it's
basically entirely equivalent in every way to just faulting in the pages as
before and requires the same hacks) then I will have to go back to the drawing
board or somehow change the faulting code.
This really sucks.
I'm not quite sure I even understand why we don't allow GUP used _just for
pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
whatever mapped it will maintain the lifetime.
What a mess...
Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I
raised my refcounting comment earlier) just read recent:
commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Wed Sep 11 17:11:23 2024 -0700
mm: avoid leaving partial pfn mappings around in error case
As Jann points out, PFN mappings are special, because unlike normal
memory mappings, there is no lifetime information associated with the
mapping - it is just a raw mapping of PFNs with no reference counting of
a 'struct page'.
I'm _very_ aware of this, having worked extensively on things around this kind
of issue recently (was a little bit involved with the above one too :), and
explicitly zap on error in this patch to ensure we leave no broken code paths.
I agree it's horrible, but we need to have a way of mapping this memory without
having to 'trick' the faulting mechanism to behave correctly.
What's completely "surprising" to me is, if there is no page_mkwrite,
but the VMA is writable, then just map the PTE writable ...
At least in perf's case, we're safe, because we ref count in open/close of VMA's.
This is a special case due to the R/W, R/O thing.
In reference to that - you said in another email about mapping one part as a
separate R/W VMA and another as a separate R/O VMA, problem is all of the perf
code is set up with its own reference counting mechanism and it's not allowed to
split/merge etc., so we'd have to totally rework all of that to make that work
and correctly refcount things.
It'd be a big task. I don't think that's a reasonable thing to put effort into
at this time...
Also who knows if there's somebody, somewhere who _relies_ somehow on this being
a single mapping...
The main issue here really is that we allow R/O pages in VM_WRITE VMAs,
and want to make write faults fail :(
What an absolute mess, yeah, without some more core changes
vm_insert_pages() cannot be used, unfortunately.
--
Cheers,
David / dhildenb