On Tue, Nov 26, 2024 at 9:31 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > On Wed, Nov 20, 2024 at 6:02 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Wed, Nov 20, 2024 at 08:20:16AM -0800, Boqun Feng wrote: > > > On Wed, Nov 20, 2024 at 10:10:44AM +0100, Alice Ryhl wrote: > > > > On Wed, Nov 20, 2024 at 5:57 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > We don't have a fully formed destination yet, so I can't give you a > > > > > definite answer to a lot of questions. Obviously I don't want to hold > > > > > up the Rust project in any way, but I need to know that what we're trying > > > > > to do will be expressible in Rust. > > > > > > > > > > Can we avoid referring to a page's refcount? > > > > > > > > I don't think this patch needs the refcount at all, and the previous > > > > version did not expose it. This came out of the advice to use put_page > > > > over free_page. Does this mean that we should switch to put_page but > > > > not use get_page? > > > > Did I advise using put_page() over free_page()? I hope I didn't say > > that. I don't see a reason why binder needs to refcount its pages (nor > > use a mapcount on them), but I don't fully understand binder so maybe > > it does need a refcount. > > I think that was me, at > <https://lore.kernel.org/all/CAG48ez32zWt4mcfA+y2FnzzNmFe-0ns9XQgp=QYeFpRsdiCAnw@xxxxxxxxxxxxxx/>. > Looking at the C binder version, binder_install_single_page() installs > pages into userspace page tables in a VM_MIXEDMAP mapping using > vm_insert_page(), and when you do that with pages from the page > allocator, userspace can grab references to them through GUP-fast (and > I think also through GUP). (See how vm_insert_page() and > vm_get_page_prot() don't use pte_mkspecial(), which is pretty much the > only thing that can stop GUP-fast on most architectures.) > > My understanding is that the combination VM_IO|VM_MIXEDMAP would stop > normal GUP, but currently the only way to block GUP-fast is to use > VM_PFNMAP. (Which, as far as I understand, is also why GPU drivers use > VM_PFNMAP so much.) Maybe we should change that, so that VM_IO and/or > VM_MIXEDMAP blocks GUP in the region and causes installed PTEs to be > marked with pte_mkspecial()? > > I am not entirely sure about this stuff, but I was recently looking at > net/packet/af_packet.c, and I tested that vmsplice() can grab > references to the high-order compound pages that > alloc_one_pg_vec_page() allocates with __get_free_pages(GFP_KERNEL | > __GFP_COMP | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY, order), > packet_mmap() inserts with vm_insert_page(), and free_pg_vec() drops > with free_pages(). (But that all happens to actually work fine, > free_pages() actually handles refcounted compound pages properly.) And also, the C binder driver wants to free pages in its shrinker callback, but those pages might still be mapped into userspace. Binder tries to zap such userspace mappings, but it does that by absolute virtual address instead of going through the rmap (see binder_alloc_free_page()), so it will miss page mappings in VMAs that have been mremap()'d (though legitimate userspace never does that with binder VMAs) or are concurrently being torn down by munmap(); so currently the thing that keeps this from falling apart is that if page mappings are left over somewhere, the page refcount ensures that this userspace-mapped page doesn't get freed. (I think the C binder code does its job, but is not exactly a great model for how to write a clean driver that integrates nicely with the rest of the kernel.)