On Wed, Nov 27, 2024 at 4:46 PM Alice Ryhl <aliceryhl@xxxxxxxxxx> wrote: > > On Wed, Nov 27, 2024 at 4:41 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > On Wed, Nov 27, 2024 at 1:01 PM Alice Ryhl <aliceryhl@xxxxxxxxxx> wrote: > > > On Tue, Nov 26, 2024 at 11:10 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > > > > > On Fri, Nov 22, 2024 at 4:41 PM Alice Ryhl <aliceryhl@xxxxxxxxxx> wrote: > > > > > This adds a type called VmAreaRef which is used when referencing a vma > > > > > that you have read access to. Here, read access means that you hold > > > > > either the mmap read lock or the vma read lock (or stronger). > > > > > > > > > > Additionally, a vma_lookup method is added to the mmap read guard, which > > > > > enables you to obtain a &VmAreaRef in safe Rust code. > > > > > > > > > > This patch only provides a way to lock the mmap read lock, but a > > > > > follow-up patch also provides a way to just lock the vma read lock. > > > > > > > > > > Acked-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> (for mm bits) > > > > > Signed-off-by: Alice Ryhl <aliceryhl@xxxxxxxxxx> > > > > > > > > Reviewed-by: Jann Horn <jannh@xxxxxxxxxx> > > > > > > Thanks! > > > > > > > with one comment: > > > > > > > > > + /// Zap pages in the given page range. > > > > > + /// > > > > > + /// This clears page table mappings for the range at the leaf level, leaving all other page > > > > > + /// tables intact, and freeing any memory referenced by the VMA in this range. That is, > > > > > + /// anonymous memory is completely freed, file-backed memory has its reference count on page > > > > > + /// cache folio's dropped, any dirty data will still be written back to disk as usual. > > > > > + #[inline] > > > > > + pub fn zap_page_range_single(&self, address: usize, size: usize) { > > > > > + // SAFETY: By the type invariants, the caller has read access to this VMA, which is > > > > > + // sufficient for this method call. This method has no requirements on the vma flags. Any > > > > > + // value of `address` and `size` is allowed. > > > > > > > > If we really want to allow any address and size, we might want to add > > > > an early bailout in zap_page_range_single(). The comment on top of > > > > zap_page_range_single() currently says "The range must fit into one > > > > VMA", and it looks like by the point we reach a bailout, we could have > > > > gone through an interval tree walk via > > > > mmu_notifier_invalidate_range_start()->__mmu_notifier_invalidate_range_start()->mn_itree_invalidate() > > > > for a range that ends before it starts; I don't know how safe that is. > > > > > > I could change the comment on zap_page_range_single() to say: > > > > > > "The range should be contained within a single VMA. Otherwise an error > > > is returned." > > > > > > And then I can add an overflow check at the top of > > > zap_page_range_single(). Sounds ok? > > > > Yes, I think changing the comment like that and adding a check for > > whether address+size wraps around there addresses this. > > Can there be a page at the top of the address space? If so, I have to > be a bit careful in the wrap-around check, because it should only fail > if the addition wraps around *and* the sum is non-zero. Uh, good question... I can't think of any architectures that have userspace mappings right at the top of the address space, and having objects allocated right at the end of the address space would violate the C standard (because C guarantees that pointers pointing directly behind an array behave reasonably, and this would not work if a pointer pointing directly behind an array could wrap around to NULL). And unless the userspace allocator takes responsibility for enforcing this edge case, the kernel has to do it by preventing the last page of the virtual address space from being mapped. (This is the reason why a 32-bit process on an arm64 kernel is normally only allowed to map addresses up to 0xfffff000, see <https://git.kernel.org/linus/d263119387de>.) Allowing userspace to map the top of the address space would also be a bad idea because then ERR_PTR() could return valid userspace pointers. Looking at the current implementation of zap_page_range_single(), in the case you're describing, unmap_single_vma() would get called with end_addr==0, and then we'd bail out on the "if (end <= vma->vm_start)" check. So calling zap_page_range_single() on such a range would already fail.