Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

"Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx> · Tue, 7 May 2024 21:20:25 -0400

* Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> [240507 15:01]:
> On Tue, May 7, 2024 at 11:06 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
...
> > >
> > > As for the mmap_read_lock_killable() (is that what we are talking
> > > about?), I'm happy to use anything else available, please give me a
> > > pointer. But I suspect given how fast and small this new API is,
> > > mmap_read_lock_killable() in it is not comparable to holding it for
> > > producing /proc/<pid>/maps contents.
> >
> > Yes, mmap_read_lock_killable() is the mmap lock (formally known as the
> > mmap sem).
> >
> > You can see examples of avoiding the mmap lock by use of rcu in
> > mm/memory.c lock_vma_under_rcu() which is used in the fault path.
> > userfaultfd has an example as well. But again, remember that not all
> > archs have this functionality, so you'd need to fall back to full mmap
> > locking.
> 
> Thanks for the pointer (didn't see email when replying on the other thread).
> 
> I looked at lock_vma_under_rcu() quickly, and seems like it's designed
> to find VMA that covers given address, but not the next closest one.
> So it's a bit problematic for the API I'm adding, as
> PROCFS_PROCMAP_EXACT_OR_NEXT_VMA (which I can rename to
> COVERING_OR_NEXT_VMA, if necessary), is quite important for the use
> cases we have. But maybe some variation of lock_vma_under_rcu() can be
> added that would fit this case?

Yes, as long as we have the rcu read lock, we can use the same
vma_next() calls you use today.  We will have to be careful not to use
the vma while it's being altered, but per-vma locking should provide
that functionality for you.

> 
> >
> > Certainly a single lookup and copy will be faster than a 4k buffer
> > filling copy, but you will be walking the tree O(n) times, where n is
> > the vma count.  This isn't as efficient as multiple lookups in a row as
> > we will re-walk from the top of the tree. You will also need to contend
> > with the fact that the chance of the vmas changing between calls is much
> > higher here too - if that's an issue. Neither of these issues go away
> > with use of the rcu locking instead of the mmap lock, but we can be
> > quite certain that we won't cause locking contention.
> 
> You are right about O(n) times, but note that for symbolization cases
> I'm describing, this n will be, generally, *much* smaller than a total
> number of VMAs within the process. It's a huge speed up in practice.
> This is because we pre-sort addresses in user-space, and then we query
> VMA for the first address, but then we quickly skip all the other
> addresses that are already covered by this VMA, and so the next
> request will query a new VMA that covers another subset of addresses.
> This way we'll get the minimal number of VMAs that cover captured
> addresses (which in the case of stack traces would be a few VMAs
> belonging to executable sections of process' binary plus a bunch of
> shared libraries).

This also implies you won't have to worry about shifting addresses?  I'd
think that the reference to the mm means none of these are going to be
changing at the point of the calls (not exiting).

Given your usecase, I'm surprised you're looking for the next vma at
all.

Thanks,
Liam