Re: [PATCH v6 1/2] rust: mm: add abstractions for mm_struct and vm_area_struct

Jann Horn <jannh@xxxxxxxxxx> · Tue, 12 Nov 2024 13:58:20 +0100

On Tue, Nov 12, 2024 at 1:12 PM Alice Ryhl <aliceryhl@xxxxxxxxxx> wrote:
> On Mon, Nov 11, 2024 at 5:51 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
> >
> > On Mon, Nov 11, 2024 at 12:41 PM Alice Ryhl <aliceryhl@xxxxxxxxxx> wrote:
> > > Thanks a lot for your review!
> > >
> > > Note that there is a v7 already:
> > > https://lore.kernel.org/all/20241014-vma-v7-0-01e32f861195@xxxxxxxxxx/
> >
> > Yeah, oops, somehow I only realized I was looking at an older version
> > of the series after sending my mail...
> >
> > > On Fri, Nov 8, 2024 at 8:02 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
> > > > On Thu, Oct 10, 2024 at 2:56 PM Alice Ryhl <aliceryhl@xxxxxxxxxx> wrote:
> > > >> These abstractions allow you to manipulate vmas. Rust Binder will uses
> > > >> these in a few different ways.
> > > >>
> > > >> In the mmap implementation, a VmAreaNew will be provided to the mmap
> > > >> call which allows it to modify the vma in ways that are only okay during
> > > >> initial setup. This is the case where the most methods are available.
> > > >>
> > > >> However, Rust Binder needs to insert and remove pages from the vma as
> > > >> time passes. When incoming messages arrive, pages may need to be
> > > >> inserted if space is missing, and in this case that is done by using a
> > > >> stashed ARef<Mm> and calling mmget_not_zero followed by mmap_write_lock
> > > >> followed by vma_lookup followed by vm_insert_page. In this case, since
> > > >> mmap_write_lock is used, the VmAreaMut type will be in use.
> > > >
> > > > FYI, the way the C binder implementation uses vma_lookup() and
> > > > vm_insert_page() is not very idiomatic. The standard way of
> > > > implementing binder_alloc_free_page() would be to use something like
> > > > unmap_mapping_range() instead of using
> > > > vma_lookup()+zap_page_range_single(); though to set that up, you'd
> > > > have to create one inode per binder file, maybe with something like
> > > > the DRM subsystem's drm_fs_inode_new(). And instead of having
> > > > binder_install_single_page(), the standard way would be to let
> > > > binder_vm_fault() install PTEs lazily on fault. That way you'd never
> > > > have to take mmap locks or grab MM references yourself.
> > >
> > > Would that actually work? All writes happen from kernel space, so it's
> > > not clear to me that we can trigger the fault handler from there. We
> > > can't use copy_to_user because the write happens from a different
> > > process than the one that owns the vma.
> > >
> > > That said, one alternative is to let Binder store an array of `struct
> > > page` and just write directly to those pages when writing from kernel
> > > space. Then binder_vm_fault() can look up the relevant page from the
> > > array when userspace attempts to read the data.
> >
> > Right, that's what I was thinking of - keep allocating pages at the
> > same point in time when binder currently does it, only defer mapping
> > them into userspace.
> >
> > > Though we could also
> > > just vm_insert_page() right before returning the address to userspace,
> > > since we know that userspace will read it right away after the syscall
> > > returns.
> >
> > I think that is basically what binder does now?
>
> Right now binder calls vm_insert_page() right after calling
> alloc_page(), which means that vm_insert_page() happens in the process
> sending the message. But we could delay the call so that it happens in
> the process that receives the message instead (which is also the
> process that owns the mapping).

Ah, I see. I don't think that would make much of a difference in terms
of the complexity of MM locking, except that you'd save an
mmget_not_zero()...

> > > > Let me know if you think it would be helpful to see a prototype of
> > > > that in C - I think I can cobble that together, but doing it nicely
> > > > will require some work to convert at least some of the binder_alloc
> > > > locking to mutexes, and some more work to switch the ->f_mapping of
> > > > the binder file or something like that. (I guess modeling that in Rust
> > > > might be a bit of a pain though...)
> > >
> > > It would be useful to hear about what the advantages of
> > > unmap_mapping_range() are. I don't need a full prototype, I should be
> > > able to understand given a rough description of what the required
> > > changes are.
> >
> > The nice part is that basically, if you have a pointer to the file or
> > the inode, you can just do something like the following to zap a PTE:
> >
> > unmap_mapping_range(file->f_mapping, index, 1, 1);
> >
> > You don't have to take any locks yourself to make that work, you don't
> > even have to hold a reference to the mm_struct or anything like that,
> > and you don't need any of that logic you currently have in the C
> > binder driver to look up the VMA by address and revalidate it is still
> > the VMA you expect. The MM subsystem will automatically iterate
> > through all VMAs that overlap the specified range of the file's
> > address_space, and it will zap PTEs in the specified range (and it
> > even works fine if the VMAs have been moved or split or exist in
> > multiple processes or stuff like that). It's a similar story on the
> > allocation path. The only locks you need to explicitly take are
> > whatever locks the driver internally requires.
> >
> > Going through the fault handler and/or the address_space for
> > installing/removing PTEs, instead of using vma_lookup(), is also safer
> > because it implicitly ensures you can only operate on your own
> > driver's VMAs. From a glance at the Rust binder RFC
> > (https://lore.kernel.org/all/20231101-rust-binder-v1-19-08ba9197f637@xxxxxxxxxx/),
> > it looks like use_page_slow() just looks up the VMA at the expected
> > address and calls insert_page() on it. I don't immediately see
> > anything in the Rust Binder RFC that would prevent calling
> > insert_page() on a newly created VMA of a different type that has
> > appeared at the same address - which could cause the page to
> > inadvertently become writable by userspace (if the new VMA is
> > writable), could cause refcounted pages to be installed in VM_PFNMAP
> > regions that are supposed to only contain non-refcounted entries,
> > could probably cause type confusion when trying to install 4K pages in
> > hugetlb regions that can't contain 4K pages, and so on.
>
> Right ... I guess I'm missing an equivalent to binder_vma_close to
> ensure that once the vma is closed, we don't try to insert pages.

Yeah, I think that would work. (I think there currently is no way to
shrink a VMA without first splitting it, so you should see ->open()
and ->close() invocations when that happens.)

> I gave a suggestion to enforce that vm_insert_page is only called on
> MIXEDMAP vmas in my previous email. I guess that would prevent the
> type confusion you mention, but it still seems dangerous ... you risk
> that some other driver is storing special data in the private data of
> pages in the new vma, and if you insert a random page there, there
> could maybe be type confusion on the private data in the `struct
> page`?

Hmm, yeah, maybe...

> > Though I just realized, you're only doing this in the shrinker
> > callback where you're not supposed to sleep, but unmap_mapping_range()
> > requires sleeping locks. So I guess directly using
> > unmap_mapping_range() wouldn't work so well. I guess one way to
> > address that could be to add a trylock version of
> > unmap_mapping_range().
> >
> > Another more fancy solution to that would be to stop using shrinkers
> > entirely, and instead make binder pages show up as clean file pages on
> > the kernel's page allocation LRU lists, and let the kernel take care
> > of removing page mappings - basically similar to how reclaim works for
> > normal file pages. I'm a bit fuzzy on how this area of the kernel
> > works exactly; one option might be to do something similar to
> > aio_private_file(), creating a new anonymous inode - but unlike in
> > AIO, you'd then install that anonymous inode's address_space as the
> > i_mapping of the existing file in binder_open(), rather than creating
> > a new file. Then you could pin the inode's pages into a page pointer
> > array in the kernel (sort of like in aio_setup_ring(), except you'd
> > only do it on demand in binder_install_buffer_pages()), and then have
> > a "release_folio" operation in your address_space_operations that
> > drops your page reference if the page is currently unused. At that
> > point, the driver wouldn't really be participating in creating or
> > removing userspace PTEs at all, the kernel would mostly manage it for
> > you, except that you'd have to tell the kernel when it is okay to
> > reclaim pages, or something like that.
>
> Whether it's okay to reclaim a given page can flip-flop very quickly
> under some workflows. Changing that setting has to be a pretty fast
> operation.

I think one fast way to do that would be to track this internally in
binder (as is the case now), and then have address_space_operations
callbacks that the kernel invokes when it wants to know whether a page
can be discarded or not.

> > (I think in the Linux kernel, you might theoretically be able to cause
> > memory safety issues by deadlocking in particular ways - if you are
> > running on a CONFIG_PREEMPT kernel without panic_on_warn set, and
> > you're in a non-preemptible context because you're holding a spinlock
> > or such, and then you sleep because you try to wait on a mutex or
> > kmalloc() with GFP_KERNEL, the scheduler will print a "scheduling
> > while atomic" error message and then try to recover from that
> > situation by resetting the preempt_count and scheduling anyway. I
> > think that theoretically breaks the RCU read-side critical sections
> > normally implied by spinlocks once you return to the broken context,
> > though IIRC this situation will get fixed up at the next syscall
> > return or something along those lines. I haven't tried this though.)
>
> Sleeping in atomic context is a known area of concern. We're working
> on it. For now, just assume that it's one of the allowed bad things.
> Eventually we would like to handle it properly with this tool:
> https://www.memorysafety.org/blog/gary-guo-klint-rust-tools/

Ah, thanks for the pointer.

> > > >> +    /// Maps a single page at the given address within the virtual memory area.
> > > >> +    ///
> > > >> +    /// This operation does not take ownership of the page.
> > > >> +    #[inline]
> > > >> +    pub fn vm_insert_page(&self, address: usize, page: &Page) -> Result {
> > > >> +        // SAFETY: By the type invariants, the caller holds the mmap write lock, so this access is
> > > >> +        // not a data race. The page is guaranteed to be valid and of order 0. The range of
> > > >> +        // `address` is already checked by `vm_insert_page`.
> > > >> +        to_result(unsafe { bindings::vm_insert_page(self.as_ptr(), address as _, page.as_ptr()) })
> > > >
> > > > vm_insert_page() has a kinda weird contract because there are two
> > > > contexts from which you can call it cleanly:
> > > >
> > > > 1. You can call it on a VmAreaRef (just like zap_page_range_single(),
> > > > you only need to hold an mmap read lock or VMA read lock, no write
> > > > lock is required); however, you must ensure that the VMA is already
> > > > marked VM_MIXEDMAP. This is the API contract under which you'd call
> > > > this from a fault handler.
> > > >
> > > > 2. You can call it on a VmAreaNew (and it will take care of setting
> > > > VM_MIXEDMAP for you).
> > > >
> > > > I think nothing would immediately crash if you called vm_insert_page()
> > > > on a VMA that does not yet have VM_MIXEDMAP while holding the mmap
> > > > lock in write mode; but that would permit weird scenarios where you
> > > > could, for example, have a userfaultfd context associated with a VMA
> > > > which becomes VM_MIXEDMAP, while normally you can't attach userfaultfd
> > > > contexts to VM_MIXEDMAP VMAs. I don't think if that actually leads to
> > > > anything bad, but it would be a weird state that probably shouldn't be
> > > > permitted.
> > > >
> > > > There are also safety requirements for the page being installed, but I
> > > > guess the checks that vm_insert_page() already does via
> > > > validate_page_before_insert() might be enough to make this safe...
> > >
> > > One way to handle this is to make an VmAreaRef::check_mixedmap that
> > > returns a VmAreaMixedMapRef after checking the flag. That type can then
> > > have a vm_insert_page method.
> >
> > Sounds reasonable.
> >
> > > As for VmAreaNew, I'm not sure we should have it there. If we're not
> > > careful, it would be a way to set VM_MIXEDMAP on something that already
> > > has the VM_PFNMAP flag. We can probably just tell you to set VM_MIXEDMAP
> > > directly and then go through the method on VmAreaRef.
> >
> > Makes sense.
> >
> > I guess one tricky part is that it might be bad if you could
>
> Seems like this sentence is incomplete?

Whoops, guess I got distracted while writing this...

I guess it could be bad if you could install page mappings before
changing the VMA flags in a way that makes the already-installed page
mappings invalid. But as long as you don't change the
VM_READ/VM_WRITE/VM_EXEC flags, and you never clear
VM_MIXEDMAP/VM_PFNMAP, I think that probably can't happen, so that
should be fine...