On Fri, Nov 29, 2024 at 03:09:49PM +0100, David Hildenbrand wrote: > > > > I just tested it and write() works fine, it uses uaccess afaict as part of the > > > > lib/iov_iter.c code: > > > > > > > > generic_perform_write() > > > > -> copy_folio_from_iter_atomic() > > > > -> copy_page_from_iter_atomic() > > > > -> __copy_from_iter() > > > > -> copy_from_user_iter() > > > > -> raw_copy_from_user() > > > > -> copy_user_generic() > > > > -> [uaccess asm] > > > > > > > > > > Ah yes. O_DIRECT is the problematic bit I suspect, which will use GUP. > > > > > > Ptrace access and friends should also no longer work on these pages, likely > > > that's tolerable. > > > > Yeah Peter can interject if not, but I'd be _very_ surprised if anybody expects > > to be able to ptrace perf counter mappings in another process (it'd be kind of > > totally insane to do that anyway since it's a ring buffer that needs special > > handing anyway). > > I think so as well. Disallowing GUP has some side effects, like not getting > these pages included in a coredump etc ... at least I hope nobody uses > O_DIRECT on them. We set VM_DONTDUMP anyway (set by remap_pfn_range() also) so this part won't be a problem, and I can't see anybody using O_DIRECT on them, sensibly. > > > > > > > > > > > > > > > > > > If we can't do pfnmap, and we definitely can't do mixedmap (because it's > > > > > > basically entirely equivalent in every way to just faulting in the pages as > > > > > > before and requires the same hacks) then I will have to go back to the drawing > > > > > > board or somehow change the faulting code. > > > > > > > > > > > > This really sucks. > > > > > > > > > > > > I'm not quite sure I even understand why we don't allow GUP used _just for > > > > > > pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption > > > > > > whatever mapped it will maintain the lifetime. > > > > > > > > > > > > What a mess... > > > > > > > > > > Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I > > > > > raised my refcounting comment earlier) just read recent: > > > > > > > > > > commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3 > > > > > Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > > > > > Date: Wed Sep 11 17:11:23 2024 -0700 > > > > > > > > > > mm: avoid leaving partial pfn mappings around in error case > > > > > As Jann points out, PFN mappings are special, because unlike normal > > > > > memory mappings, there is no lifetime information associated with the > > > > > mapping - it is just a raw mapping of PFNs with no reference counting of > > > > > a 'struct page'. > > > > > > > > > > > > > I'm _very_ aware of this, having worked extensively on things around this kind > > > > of issue recently (was a little bit involved with the above one too :), and > > > > explicitly zap on error in this patch to ensure we leave no broken code paths. > > > > > > > > I agree it's horrible, but we need to have a way of mapping this memory without > > > > having to 'trick' the faulting mechanism to behave correctly. > > > > > > What's completely "surprising" to me is, if there is no page_mkwrite, but > > > the VMA is writable, then just map the PTE writable ... > > > > I've had a number of surprises on this journey :) > > > > To make sure I understand what you mean do you mean the whole: > > > > do_wp_page() > > -> wp_page_shared() > > -> if (page_mkwrite) { ... } else { wp_page_reuse(); } > > -> wp_page_reuse() > > -> maybe_mkwrite() [hey VM_WRITE is set, so yes make writable!] > > > > Path? > > Yes. I can see how it might be relevant (mprotect(PROT_READ) + > mprotect(READ|WRITE)), but it's a bit counter-intuitive ... to default to > "writable". Yeah, no this whole thing very much resembles a labyrinth with a minotaur at the end, whose name is VM_PFNMAP :>) > > [...] > > > > > > > > So overall - given all the above and the fact that the alternatives are _even > > worse_ (having to spuriously folio lock if that'll even work, totally > > unnecessary and semantically incorrect folio-fication or a complete rework of > > mapping) - while you might be sicked by this use of the VM_PFNMAP - are you ok > > with the patch, all things considered? :) > > It hurts my eyes, and feels like a step into the wrong direction. But I > don't really see how to do it differently right now. Yeah I agree totally. > > Can we add a comment/TODO in there that using remap_pfn_range this to map a > kernel allocation really is suboptimal and we should switch to something > better once we figured the details out? Sure, let me do a v2 then to make Peter's life easier, can do the nommu fixup in there too. > > > BTW, I think the whole reason vm_insert_pages() touches the mapcount is > because we didn't really have a way to identify during unmap that we must > not adjust it either. A page->type will be able to sort that out (while > still allowing to refcount them for now). Yeah. I think not being able to differentiate between different cases is the problem here... > > Actually, the whole thing is on my todo list, because messing with the RMAP > with vm_insert_pages() doesn't make any sense (whereby the refcount might!). > See the TODO I added in __folio_rmap_sanity_checks(). How long is your TODO list I wonder? :)) I imagine it requires a huge page to map at this point.. > > -- > Cheers, > > David / dhildenb > Cool will respin and send out shortly with an appropriately 'forgive us father for we have sinned' comment. Thanks for taking the time to discuss this! Much appreciated.