On Tue, Nov 12, 2024 at 1:12 PM Alice Ryhl <aliceryhl@xxxxxxxxxx> wrote: > On Mon, Nov 11, 2024 at 5:51 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > On Mon, Nov 11, 2024 at 12:41 PM Alice Ryhl <aliceryhl@xxxxxxxxxx> wrote: > > > Thanks a lot for your review! > > > > > > Note that there is a v7 already: > > > https://lore.kernel.org/all/20241014-vma-v7-0-01e32f861195@xxxxxxxxxx/ > > > > Yeah, oops, somehow I only realized I was looking at an older version > > of the series after sending my mail... > > > > > On Fri, Nov 8, 2024 at 8:02 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > On Thu, Oct 10, 2024 at 2:56 PM Alice Ryhl <aliceryhl@xxxxxxxxxx> wrote: > > > >> These abstractions allow you to manipulate vmas. Rust Binder will uses > > > >> these in a few different ways. > > > >> > > > >> In the mmap implementation, a VmAreaNew will be provided to the mmap > > > >> call which allows it to modify the vma in ways that are only okay during > > > >> initial setup. This is the case where the most methods are available. > > > >> > > > >> However, Rust Binder needs to insert and remove pages from the vma as > > > >> time passes. When incoming messages arrive, pages may need to be > > > >> inserted if space is missing, and in this case that is done by using a > > > >> stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock > > > >> followed by vma_lookup followed by vm_insert_page. In this case, since > > > >> mmap_write_lock is used, the VmAreaMut type will be in use. > > > > > > > > FYI, the way the C binder implementation uses vma_lookup() and > > > > vm_insert_page() is not very idiomatic. The standard way of > > > > implementing binder_alloc_free_page() would be to use something like > > > > unmap_mapping_range() instead of using > > > > vma_lookup()+zap_page_range_single(); though to set that up, you'd > > > > have to create one inode per binder file, maybe with something like > > > > the DRM subsystem's drm_fs_inode_new(). And instead of having > > > > binder_install_single_page(), the standard way would be to let > > > > binder_vm_fault() install PTEs lazily on fault. That way you'd never > > > > have to take mmap locks or grab MM references yourself. > > > > > > Would that actually work? All writes happen from kernel space, so it's > > > not clear to me that we can trigger the fault handler from there. We > > > can't use copy_to_user because the write happens from a different > > > process than the one that owns the vma. > > > > > > That said, one alternative is to let Binder store an array of `struct > > > page` and just write directly to those pages when writing from kernel > > > space. Then binder_vm_fault() can look up the relevant page from the > > > array when userspace attempts to read the data. > > > > Right, that's what I was thinking of - keep allocating pages at the > > same point in time when binder currently does it, only defer mapping > > them into userspace. > > > > > Though we could also > > > just vm_insert_page() right before returning the address to userspace, > > > since we know that userspace will read it right away after the syscall > > > returns. > > > > I think that is basically what binder does now? > > Right now binder calls vm_insert_page() right after calling > alloc_page(), which means that vm_insert_page() happens in the process > sending the message. But we could delay the call so that it happens in > the process that receives the message instead (which is also the > process that owns the mapping). Ah, I see. I don't think that would make much of a difference in terms of the complexity of MM locking, except that you'd save an mmget_not_zero()... > > > > Let me know if you think it would be helpful to see a prototype of > > > > that in C - I think I can cobble that together, but doing it nicely > > > > will require some work to convert at least some of the binder_alloc > > > > locking to mutexes, and some more work to switch the ->f_mapping of > > > > the binder file or something like that. (I guess modeling that in Rust > > > > might be a bit of a pain though...) > > > > > > It would be useful to hear about what the advantages of > > > unmap_mapping_range() are. I don't need a full prototype, I should be > > > able to understand given a rough description of what the required > > > changes are. > > > > The nice part is that basically, if you have a pointer to the file or > > the inode, you can just do something like the following to zap a PTE: > > > > unmap_mapping_range(file->f_mapping, index, 1, 1); > > > > You don't have to take any locks yourself to make that work, you don't > > even have to hold a reference to the mm_struct or anything like that, > > and you don't need any of that logic you currently have in the C > > binder driver to look up the VMA by address and revalidate it is still > > the VMA you expect. The MM subsystem will automatically iterate > > through all VMAs that overlap the specified range of the file's > > address_space, and it will zap PTEs in the specified range (and it > > even works fine if the VMAs have been moved or split or exist in > > multiple processes or stuff like that). It's a similar story on the > > allocation path. The only locks you need to explicitly take are > > whatever locks the driver internally requires. > > > > Going through the fault handler and/or the address_space for > > installing/removing PTEs, instead of using vma_lookup(), is also safer > > because it implicitly ensures you can only operate on your own > > driver's VMAs. From a glance at the Rust binder RFC > > (https://lore.kernel.org/all/20231101-rust-binder-v1-19-08ba9197f637@xxxxxxxxxx/), > > it looks like use_page_slow() just looks up the VMA at the expected > > address and calls insert_page() on it. I don't immediately see > > anything in the Rust Binder RFC that would prevent calling > > insert_page() on a newly created VMA of a different type that has > > appeared at the same address - which could cause the page to > > inadvertently become writable by userspace (if the new VMA is > > writable), could cause refcounted pages to be installed in VM_PFNMAP > > regions that are supposed to only contain non-refcounted entries, > > could probably cause type confusion when trying to install 4K pages in > > hugetlb regions that can't contain 4K pages, and so on. > > Right ... I guess I'm missing an equivalent to binder_vma_close to > ensure that once the vma is closed, we don't try to insert pages. Yeah, I think that would work. (I think there currently is no way to shrink a VMA without first splitting it, so you should see ->open() and ->close() invocations when that happens.) > I gave a suggestion to enforce that vm_insert_page is only called on > MIXEDMAP vmas in my previous email. I guess that would prevent the > type confusion you mention, but it still seems dangerous ... you risk > that some other driver is storing special data in the private data of > pages in the new vma, and if you insert a random page there, there > could maybe be type confusion on the private data in the `struct > page`? Hmm, yeah, maybe... > > Though I just realized, you're only doing this in the shrinker > > callback where you're not supposed to sleep, but unmap_mapping_range() > > requires sleeping locks. So I guess directly using > > unmap_mapping_range() wouldn't work so well. I guess one way to > > address that could be to add a trylock version of > > unmap_mapping_range(). > > > > Another more fancy solution to that would be to stop using shrinkers > > entirely, and instead make binder pages show up as clean file pages on > > the kernel's page allocation LRU lists, and let the kernel take care > > of removing page mappings - basically similar to how reclaim works for > > normal file pages. I'm a bit fuzzy on how this area of the kernel > > works exactly; one option might be to do something similar to > > aio_private_file(), creating a new anonymous inode - but unlike in > > AIO, you'd then install that anonymous inode's address_space as the > > i_mapping of the existing file in binder_open(), rather than creating > > a new file. Then you could pin the inode's pages into a page pointer > > array in the kernel (sort of like in aio_setup_ring(), except you'd > > only do it on demand in binder_install_buffer_pages()), and then have > > a "release_folio" operation in your address_space_operations that > > drops your page reference if the page is currently unused. At that > > point, the driver wouldn't really be participating in creating or > > removing userspace PTEs at all, the kernel would mostly manage it for > > you, except that you'd have to tell the kernel when it is okay to > > reclaim pages, or something like that. > > Whether it's okay to reclaim a given page can flip-flop very quickly > under some workflows. Changing that setting has to be a pretty fast > operation. I think one fast way to do that would be to track this internally in binder (as is the case now), and then have address_space_operations callbacks that the kernel invokes when it wants to know whether a page can be discarded or not. > > (I think in the Linux kernel, you might theoretically be able to cause > > memory safety issues by deadlocking in particular ways - if you are > > running on a CONFIG_PREEMPT kernel without panic_on_warn set, and > > you're in a non-preemptible context because you're holding a spinlock > > or such, and then you sleep because you try to wait on a mutex or > > kmalloc() with GFP_KERNEL, the scheduler will print a "scheduling > > while atomic" error message and then try to recover from that > > situation by resetting the preempt_count and scheduling anyway. I > > think that theoretically breaks the RCU read-side critical sections > > normally implied by spinlocks once you return to the broken context, > > though IIRC this situation will get fixed up at the next syscall > > return or something along those lines. I haven't tried this though.) > > Sleeping in atomic context is a known area of concern. We're working > on it. For now, just assume that it's one of the allowed bad things. > Eventually we would like to handle it properly with this tool: > https://www.memorysafety.org/blog/gary-guo-klint-rust-tools/ Ah, thanks for the pointer. > > > >> + /// Maps a single page at the given address within the virtual memory area. > > > >> + /// > > > >> + /// This operation does not take ownership of the page. > > > >> + #[inline] > > > >> + pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result { > > > >> + // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is > > > >> + // not a data race. The page is guaranteed to be valid and of order 0. The range of > > > >> + // `address` is already checked by `vm_insert_page`. > > > >> + to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address as _, page.as_ptr()) }) > > > > > > > > vm_insert_page() has a kinda weird contract because there are two > > > > contexts from which you can call it cleanly: > > > > > > > > 1. You can call it on a VmAreaRef (just like zap_page_range_single(), > > > > you only need to hold an mmap read lock or VMA read lock, no write > > > > lock is required); however, you must ensure that the VMA is already > > > > marked VM_MIXEDMAP. This is the API contract under which you'd call > > > > this from a fault handler. > > > > > > > > 2. You can call it on a VmAreaNew (and it will take care of setting > > > > VM_MIXEDMAP for you). > > > > > > > > I think nothing would immediately crash if you called vm_insert_page() > > > > on a VMA that does not yet have VM_MIXEDMAP while holding the mmap > > > > lock in write mode; but that would permit weird scenarios where you > > > > could, for example, have a userfaultfd context associated with a VMA > > > > which becomes VM_MIXEDMAP, while normally you can't attach userfaultfd > > > > contexts to VM_MIXEDMAP VMAs. I don't think if that actually leads to > > > > anything bad, but it would be a weird state that probably shouldn't be > > > > permitted. > > > > > > > > There are also safety requirements for the page being installed, but I > > > > guess the checks that vm_insert_page() already does via > > > > validate_page_before_insert() might be enough to make this safe... > > > > > > One way to handle this is to make an VmAreaRef::check_mixedmap that > > > returns a VmAreaMixedMapRef after checking the flag. That type can then > > > have a vm_insert_page method. > > > > Sounds reasonable. > > > > > As for VmAreaNew, I'm not sure we should have it there. If we're not > > > careful, it would be a way to set VM_MIXEDMAP on something that already > > > has the VM_PFNMAP flag. We can probably just tell you to set VM_MIXEDMAP > > > directly and then go through the method on VmAreaRef. > > > > Makes sense. > > > > I guess one tricky part is that it might be bad if you could > > Seems like this sentence is incomplete? Whoops, guess I got distracted while writing this... I guess it could be bad if you could install page mappings before changing the VMA flags in a way that makes the already-installed page mappings invalid. But as long as you don't change the VM_READ/VM_WRITE/VM_EXEC flags, and you never clear VM_MIXEDMAP/VM_PFNMAP, I think that probably can't happen, so that should be fine...