Re: [PATCH 00/10] perf/uprobe: Optimize uprobes

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 2 Aug 2024 22:47:15 -0700

On Tue, Jul 30, 2024 at 11:10 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Tue, Jul 30, 2024 at 6:11 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Sat, Jul 27, 2024 at 04:45:53AM +0100, Matthew Wilcox wrote:
> >
> > > Hum.  What if we added SLAB_TYPESAFE_BY_RCU to files_cachep?  That way
> > > we could do:
> > >
> > >       inode = NULL;
> > >       rcu_read_lock();
> > >       vma = find_vma(mm, address);
> > >       if (!vma)
> > >               goto unlock;
> > >       file = READ_ONCE(vma->vm_file);
> > >       if (!file)
> > >               goto unlock;
> > >       inode = file->f_inode;
> > >       if (file != READ_ONCE(vma->vm_file))
> > >               inode = NULL;
> >
> > remove_vma() does not clear vm_file, nor do I think we ever re-assign
> > this field after it is set on creation.
>
> Quite correct and even if we clear vm_file in remove_vma() and/or
> reset it on creation I don't think that would be enough. IIUC the
> warning about SLAB_TYPESAFE_BY_RCU here:
> https://elixir.bootlin.com/linux/v6.10.2/source/include/linux/slab.h#L98
> means that the vma object can be reused in the same RCU grace period.
>
> >
> > That is, I'm struggling to see what this would do. AFAICT this can still
> > happen:
> >
> >         rcu_read_lock();
> >         vma = find_vma();
> >                                         remove_vma()
> >                                           fput(vma->vm_file);
> >                                                                 dup_fd)
> >                                                                   newf = kmem_cache_alloc(...)
> >                                                                   newf->f_inode = blah
> >
>
> Imagine that the vma got freed and reused at this point. Then
> vma->vm_file might be pointing to a valid but a completely different
> file.
>
> >         file = READ_ONCE(vma->vm_file);
> >         inode = file->f_inode; // blah
> >         if (file != READ_ONCE(vma->vm_file)) // still match
>
> I think we should also check that the VMA represents the same area
> after we obtained the inode.
>
> >
> >
> > > unlock:
> > >       rcu_read_unlock();
> > >
> > >       if (inode)
> > >               return inode;
> > >       mmap_read_lock();
> > >       vma = find_vma(mm, address);
> > >       ...
> > >
> > > I think this would be safe because 'vma' will not be reused while we
> > > hold the read lock, and while 'file' might be reused, whatever f_inode
> > > points to won't be used if vm_file is no longer what it once was.
> >
> >
> > Also, we need vaddr_to_offset() which needs additional serialization
> > against vma_lock.

Is there any reason why the approach below won't work? I basically
open-coded the logic in find_active_uprobe(), doing find_vma() under
RCU lock, then fetching vma->vm_file->f_inode. If at any point any
check fails, we fallback to mmap_read_lock-protected logic, but if
not, we just validate that vma->vm_lock_seq didn't change in between
all this.

Note that we don't need to grab inode refcount, we already keep the
reference to inodes in uprobes_tree, so if we find a match (using
find_uprobe() call), then we are good. As long as that
vma->vm_file->f_inode chain didn't change, of course.

Is vma->vm_lock_seq updated on any change to a VMA? This seems to work
fine in practice, but would be good to know for sure.


Author: Andrii Nakryiko <andrii@xxxxxxxxxx>
Date:   Fri Aug 2 22:16:40 2024 -0700

    uprobes: add speculative lockless VMA to inode resolution

    Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8be9e34e786a..e21b68a39f13 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2251,6 +2251,52 @@ static struct uprobe
*find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
        struct uprobe *uprobe = NULL;
        struct vm_area_struct *vma;

+#ifdef CONFIG_PER_VMA_LOCK
+       vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE, vm_flags;
+       struct file *vm_file;
+       struct inode *vm_inode;
+       unsigned long vm_pgoff, vm_start, vm_end;
+       int vm_lock_seq;
+       loff_t offset;
+
+       rcu_read_lock();
+
+       vma = vma_lookup(mm, bp_vaddr);
+       if (!vma)
+               goto retry_with_lock;
+
+       vm_lock_seq = READ_ONCE(vma->vm_lock_seq);
+
+       vm_file = READ_ONCE(vma->vm_file);
+       vm_flags = READ_ONCE(vma->vm_flags);
+       if (!vm_file || (vm_flags & flags) != VM_MAYEXEC)
+               goto retry_with_lock;
+
+       vm_inode = READ_ONCE(vm_file->f_inode);
+       vm_pgoff = READ_ONCE(vma->vm_pgoff);
+       vm_start = READ_ONCE(vma->vm_start);
+       vm_end = READ_ONCE(vma->vm_end);
+       if (bp_vaddr < vm_start || bp_vaddr >= vm_end)
+               goto retry_with_lock;
+
+       offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
+       uprobe = find_uprobe_rcu(vm_inode, offset);
+       if (!uprobe)
+               goto retry_with_lock;
+
+       /* now double check that nothing about VMA changed */
+       if (vm_lock_seq != READ_ONCE(vma->vm_lock_seq))
+               goto retry_with_lock;
+
+       /* happy case, we speculated successfully */
+       rcu_read_unlock();
+       return uprobe;
+
+retry_with_lock:
+       rcu_read_unlock();
+       uprobe = NULL;
+#endif
+
        mmap_read_lock(mm);
        vma = vma_lookup(mm, bp_vaddr);
        if (vma) {
diff --git a/kernel/fork.c b/kernel/fork.c
index cc760491f201..211a84ee92b4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3160,7 +3160,7 @@ void __init proc_caches_init(void)
                        NULL);
        files_cachep = kmem_cache_create("files_cache",
                        sizeof(struct files_struct), 0,
-                       SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU,
                        NULL);
        fs_cachep = kmem_cache_create("fs_cache",
                        sizeof(struct fs_struct), 0,