Re: [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to inode resolution

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Thu, 15 Aug 2024 10:45:45 -0700

On Thu, Aug 15, 2024 at 9:47 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Thu, Aug 15, 2024 at 6:44 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> >
> > On Tue, Aug 13, 2024 at 08:36:03AM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Aug 12, 2024 at 11:18 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> > > >
> > > > On Mon, Aug 12, 2024 at 09:29:17PM -0700, Andrii Nakryiko wrote:
> > > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access
> > > > > vma->vm_file->f_inode lockless only under rcu_read_lock() protection,
> > > > > attempting uprobe look up speculatively.
> > > > >
> > > > > We rely on newly added mmap_lock_speculation_{start,end}() helpers to
> > > > > validate that mm_struct stays intact for entire duration of this
> > > > > speculation. If not, we fall back to mmap_lock-protected lookup.
> > > > >
> > > > > This allows to avoid contention on mmap_lock in absolutely majority of
> > > > > cases, nicely improving uprobe/uretprobe scalability.
> > > > >
> > > >
> > > > Here I have to admit to being mostly ignorant about the mm, so bear with
> > > > me. :>
> > > >
> > > > I note the result of find_active_uprobe_speculative is immediately stale
> > > > in face of modifications.
> > > >
> > > > The thing I'm after is that the mmap_lock_speculation business adds
> > > > overhead on archs where a release fence is not a de facto nop and I
> > > > don't believe the commit message justifies it. Definitely a bummer to
> > > > add merely it for uprobes. If there are bigger plans concerning it
> > > > that's a different story of course.
> > > >
> > > > With this in mind I have to ask if instead you could perhaps get away
> > > > with the already present per-vma sequence counter?
> > >
> > > per-vma sequence counter does not implement acquire/release logic, it
> > > relies on vma->vm_lock for synchronization. So if we want to use it,
> > > we would have to add additional memory barriers here. This is likely
> > > possible but as I mentioned before we would need to ensure the
> > > pagefault path does not regress. OTOH mm->mm_lock_seq already halfway
> > > there (it implements acquire/release logic), we just had to ensure
> > > mmap_write_lock() increments mm->mm_lock_seq.
> > >
> > > So, from the release fence overhead POV I think whether we use
> > > mm->mm_lock_seq or vma->vm_lock, we would still need a proper fence
> > > here.
> > >
> >
> > Per my previous e-mail I'm not particularly familiar with mm internals,
> > so I'm going to handwave a little bit with my $0,03 concerning multicore
> > in general and if you disagree with it that's your business. For the
> > time being I have no interest in digging into any of this.
> >
> > Before I do, to prevent this thread from being a total waste, here are
> > some remarks concerning the patch with the assumption that the core idea
> > lands.
> >
> > From the commit message:
> > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access
> > > vma->vm_file->f_inode lockless only under rcu_read_lock() protection,
> > > attempting uprobe look up speculatively.
> >
> > Just in case I'll note a nit that this paragraph will need to be removed
> > since the patch adding the flag is getting dropped.
>
> Yep, of course, I'll update all that for the next revision (I'll wait
> for non-RFC patches to land first before reposting).
>
> >
> > A non-nit which may or may not end up mattering is that the flag (which
> > *is* set on the filep slab cache) makes things more difficult to
> > validate. Normal RCU usage guarantees that the object itself wont be
> > freed as long you follow the rules. However, the SLAB_TYPESAFE_BY_RCU
> > flag weakens it significantly -- the thing at hand will always be a
> > 'struct file', but it may get reallocated to *another* file from under
> > you. Whether this aspect plays a role here I don't know.
>
> Yes, that's ok and is accounted for. We care about that memory not
> going even from under us (I'm not even sure if it matters that it is
> still a struct file, tbh; I think that shouldn't matter as we are
> prepared to deal with completely garbage values read from struct
> file).

Correct, with SLAB_TYPESAFE_BY_RCU we do need an additional check that
vma->vm_file has not been freed and reused. That's where
mmap_lock_speculation_{start|end} helps us. For vma->vm_file to change
from under us one would have to take mmap_lock for write. If that
happens mmap_lock_speculation_{start|end} should detect that and
terminate our speculation.

>
> >
> > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
> > > +{
> > > +     const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
> > > +     struct mm_struct *mm = current->mm;
> > > +     struct uprobe *uprobe;
> > > +     struct vm_area_struct *vma;
> > > +     struct file *vm_file;
> > > +     struct inode *vm_inode;
> > > +     unsigned long vm_pgoff, vm_start;
> > > +     int seq;
> > > +     loff_t offset;
> > > +
> > > +     if (!mmap_lock_speculation_start(mm, &seq))
> > > +             return NULL;
> > > +
> > > +     rcu_read_lock();
> > > +
> >
> > I don't think there is a correctness problem here, but entering rcu
> > *after* deciding to speculatively do the lookup feels backwards.
>
> RCU should protect VMA and file, mm itself won't go anywhere, so this seems ok.
>
> >
> > > +     vma = vma_lookup(mm, bp_vaddr);
> > > +     if (!vma)
> > > +             goto bail;
> > > +
> > > +     vm_file = data_race(vma->vm_file);
> > > +     if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
> > > +             goto bail;
> > > +
> >
> > If vma teardown is allowed to progress and the file got fput'ed...
> >
> > > +     vm_inode = data_race(vm_file->f_inode);
> >
> > ... the inode can be NULL, I don't know if that's handled.
> >
>
> Yep, inode pointer value is part of RB-tree key, so if it's NULL, we
> just won't find a matching uprobe. Same for any other "garbage"
> f_inode value. Importantly, we never should dereference such inode
> pointers, at least until we find a valid uprobe (in which case we keep
> inode reference to it).
>
> > More importantly though, per my previous description of
> > SLAB_TYPESAFE_BY_RCU, by now the file could have been reallocated and
> > the inode you did find is completely unrelated.
> >
> > I understand the intent is to backpedal from everything should the mm
> > seqc change, but the above may happen to matter.
>
> Yes, I think we took that into account. All that we care about is
> memory "type safety", i.e., even if struct file's memory is reused,
> it's ok, we'll eventually detect the change and will discard wrong
> uprobe that we might by accident lookup (though probably in most cases
> we just won't find a uprobe at all).
>
> >
> > > +     vm_pgoff = data_race(vma->vm_pgoff);
> > > +     vm_start = data_race(vma->vm_start);
> > > +
> > > +     offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> > > +     uprobe = find_uprobe_rcu(vm_inode, offset);
> > > +     if (!uprobe)
> > > +             goto bail;
> > > +
> > > +     /* now double check that nothing about MM changed */
> > > +     if (!mmap_lock_speculation_end(mm, seq))
> > > +             goto bail;
> >
> > This leaks the reference obtained by find_uprobe_rcu().
>
> find_uprobe_rcu() doesn't obtain a reference, uprobe is RCU-protected,
> and if caller need a refcount bump it will have to use
> try_get_uprobe() (which might fail).
>
> >
> > > +
> > > +     rcu_read_unlock();
> > > +
> > > +     /* happy case, we speculated successfully */
> > > +     return uprobe;
> > > +bail:
> > > +     rcu_read_unlock();
> > > +     return NULL;
> > > +}
> >
> > Now to some handwaving, here it is:
> >
> > The core of my concern is that adding more work to down_write on the
> > mmap semaphore comes with certain side-effects and plausibly more than a
> > sufficient speed up can be achieved without doing it.

AFAIK writers of mmap_lock are not considered a fast path. In a sense
yes, we made any writer a bit heavier but OTOH we also made
mm->mm_lock_seq a proper sequence count which allows us to locklessly
check if mmap_lock is write-locked. I think you asked whether there
will be other uses for mmap_lock_speculation_{start|end} and yes. For
example, I am planning to use them for printing /proc/{pid}/maps
without taking mmap_lock (when it's uncontended). If we have VMA seq
counter-based detection it would be better (see below).

> >
> > An mm-wide mechanism is just incredibly coarse-grained and it may happen
> > to perform poorly when faced with a program which likes to mess with its
> > address space -- the fast path is going to keep failing and only
> > inducing *more* overhead as the code decides to down_read the mmap
> > semaphore.
> >
> > Furthermore there may be work currently synchronized with down_write
> > which perhaps can transition to "merely" down_read, but by the time it
> > happens this and possibly other consumers expect a change in the
> > sequence counter, messing with it.
> >
> > To my understanding the kernel supports parallel faults with per-vma
> > locking. I would find it surprising if the same machinery could not be
> > used to sort out uprobe handling above.