On Tue, Jul 30, 2024 at 11:10 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Tue, Jul 30, 2024 at 6:11 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > On Sat, Jul 27, 2024 at 04:45:53AM +0100, Matthew Wilcox wrote: > > > > > Hum. What if we added SLAB_TYPESAFE_BY_RCU to files_cachep? That way > > > we could do: > > > > > > inode = NULL; > > > rcu_read_lock(); > > > vma = find_vma(mm, address); > > > if (!vma) > > > goto unlock; > > > file = READ_ONCE(vma->vm_file); > > > if (!file) > > > goto unlock; > > > inode = file->f_inode; > > > if (file != READ_ONCE(vma->vm_file)) > > > inode = NULL; > > > > remove_vma() does not clear vm_file, nor do I think we ever re-assign > > this field after it is set on creation. > > Quite correct and even if we clear vm_file in remove_vma() and/or > reset it on creation I don't think that would be enough. IIUC the > warning about SLAB_TYPESAFE_BY_RCU here: > https://elixir.bootlin.com/linux/v6.10.2/source/include/linux/slab.h#L98 > means that the vma object can be reused in the same RCU grace period. > > > > > That is, I'm struggling to see what this would do. AFAICT this can still > > happen: > > > > rcu_read_lock(); > > vma = find_vma(); > > remove_vma() > > fput(vma->vm_file); > > dup_fd) > > newf = kmem_cache_alloc(...) > > newf->f_inode = blah > > > > Imagine that the vma got freed and reused at this point. Then > vma->vm_file might be pointing to a valid but a completely different > file. > > > file = READ_ONCE(vma->vm_file); > > inode = file->f_inode; // blah > > if (file != READ_ONCE(vma->vm_file)) // still match > > I think we should also check that the VMA represents the same area > after we obtained the inode. > > > > > > > > unlock: > > > rcu_read_unlock(); > > > > > > if (inode) > > > return inode; > > > mmap_read_lock(); > > > vma = find_vma(mm, address); > > > ... > > > > > > I think this would be safe because 'vma' will not be reused while we > > > hold the read lock, and while 'file' might be reused, whatever f_inode > > > points to won't be used if vm_file is no longer what it once was. > > > > > > Also, we need vaddr_to_offset() which needs additional serialization > > against vma_lock. Is there any reason why the approach below won't work? I basically open-coded the logic in find_active_uprobe(), doing find_vma() under RCU lock, then fetching vma->vm_file->f_inode. If at any point any check fails, we fallback to mmap_read_lock-protected logic, but if not, we just validate that vma->vm_lock_seq didn't change in between all this. Note that we don't need to grab inode refcount, we already keep the reference to inodes in uprobes_tree, so if we find a match (using find_uprobe() call), then we are good. As long as that vma->vm_file->f_inode chain didn't change, of course. Is vma->vm_lock_seq updated on any change to a VMA? This seems to work fine in practice, but would be good to know for sure. Author: Andrii Nakryiko <andrii@xxxxxxxxxx> Date: Fri Aug 2 22:16:40 2024 -0700 uprobes: add speculative lockless VMA to inode resolution Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8be9e34e786a..e21b68a39f13 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2251,6 +2251,52 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb struct uprobe *uprobe = NULL; struct vm_area_struct *vma; +#ifdef CONFIG_PER_VMA_LOCK + vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE, vm_flags; + struct file *vm_file; + struct inode *vm_inode; + unsigned long vm_pgoff, vm_start, vm_end; + int vm_lock_seq; + loff_t offset; + + rcu_read_lock(); + + vma = vma_lookup(mm, bp_vaddr); + if (!vma) + goto retry_with_lock; + + vm_lock_seq = READ_ONCE(vma->vm_lock_seq); + + vm_file = READ_ONCE(vma->vm_file); + vm_flags = READ_ONCE(vma->vm_flags); + if (!vm_file || (vm_flags & flags) != VM_MAYEXEC) + goto retry_with_lock; + + vm_inode = READ_ONCE(vm_file->f_inode); + vm_pgoff = READ_ONCE(vma->vm_pgoff); + vm_start = READ_ONCE(vma->vm_start); + vm_end = READ_ONCE(vma->vm_end); + if (bp_vaddr < vm_start || bp_vaddr >= vm_end) + goto retry_with_lock; + + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start); + uprobe = find_uprobe_rcu(vm_inode, offset); + if (!uprobe) + goto retry_with_lock; + + /* now double check that nothing about VMA changed */ + if (vm_lock_seq != READ_ONCE(vma->vm_lock_seq)) + goto retry_with_lock; + + /* happy case, we speculated successfully */ + rcu_read_unlock(); + return uprobe; + +retry_with_lock: + rcu_read_unlock(); + uprobe = NULL; +#endif + mmap_read_lock(mm); vma = vma_lookup(mm, bp_vaddr); if (vma) { diff --git a/kernel/fork.c b/kernel/fork.c index cc760491f201..211a84ee92b4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -3160,7 +3160,7 @@ void __init proc_caches_init(void) NULL); files_cachep = kmem_cache_create("files_cache", sizeof(struct files_struct), 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU, NULL); fs_cachep = kmem_cache_create("fs_cache", sizeof(struct fs_struct), 0,