Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 7 May 2024 12:00:43 -0700

On Tue, May 7, 2024 at 11:06 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
>
> * Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> [240507 12:28]:
> > On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
> > >
> > > .. Adding Suren & Willy to the Cc
> > >
> > > * Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> [240504 18:14]:
> > > > On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > I also did an strace run of both cases. In text-based one the tool did
> > > > > > 68 read() syscalls, fetching up to 4KB of data in one go.
> > > > >
> > > > > Why not fetch more at once?
> > > > >
> > > >
> > > > I didn't expect to be interrogated so much on the performance of the
> > > > text parsing front, sorry. :) You can probably tune this, but where is
> > > > the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> > > > production numbers.
> > >
> > > The reason the file reads are limited to 4KB is because this file is
> > > used for monitoring processes.  We have a significant number of
> > > organisations polling this file so frequently that the mmap lock
> > > contention becomes an issue. (reading a file is free, right?)  People
> > > also tend to try to figure out why a process is slow by reading this
> > > file - which amplifies the lock contention.
> > >
> > > What happens today is that the lock is yielded after 4KB to allow time
> > > for mmap writes to happen.  This also means your data may be
> > > inconsistent from one 4KB block to the next (the write may be around
> > > this boundary).
> > >
> > > This new interface also takes the lock in do_procmap_query() and does
> > > the 4kb blocks as well.  Extending this size means more time spent
> > > blocking mmap writes, but a more consistent view of the world (less
> > > "tearing" of the addresses).
> >
> > Hold on. There is no 4KB in the new ioctl-based API I'm adding. It
> > does a single VMA look up (presumably O(logN) operation) using a
> > single vma_iter_init(addr) + vma_next() call on vma_iterator.
>
> Sorry, I read this:
>
> +       if (usize > PAGE_SIZE)
> +               return -E2BIG;
>
> And thought you were going to return many vmas in that buffer.  I see
> now that you are doing one copy at a time.
>
> >
> > As for the mmap_read_lock_killable() (is that what we are talking
> > about?), I'm happy to use anything else available, please give me a
> > pointer. But I suspect given how fast and small this new API is,
> > mmap_read_lock_killable() in it is not comparable to holding it for
> > producing /proc/<pid>/maps contents.
>
> Yes, mmap_read_lock_killable() is the mmap lock (formally known as the
> mmap sem).
>
> You can see examples of avoiding the mmap lock by use of rcu in
> mm/memory.c lock_vma_under_rcu() which is used in the fault path.
> userfaultfd has an example as well. But again, remember that not all
> archs have this functionality, so you'd need to fall back to full mmap
> locking.

Thanks for the pointer (didn't see email when replying on the other thread).

I looked at lock_vma_under_rcu() quickly, and seems like it's designed
to find VMA that covers given address, but not the next closest one.
So it's a bit problematic for the API I'm adding, as
PROCFS_PROCMAP_EXACT_OR_NEXT_VMA (which I can rename to
COVERING_OR_NEXT_VMA, if necessary), is quite important for the use
cases we have. But maybe some variation of lock_vma_under_rcu() can be
added that would fit this case?

>
> Certainly a single lookup and copy will be faster than a 4k buffer
> filling copy, but you will be walking the tree O(n) times, where n is
> the vma count.  This isn't as efficient as multiple lookups in a row as
> we will re-walk from the top of the tree. You will also need to contend
> with the fact that the chance of the vmas changing between calls is much
> higher here too - if that's an issue. Neither of these issues go away
> with use of the rcu locking instead of the mmap lock, but we can be
> quite certain that we won't cause locking contention.

You are right about O(n) times, but note that for symbolization cases
I'm describing, this n will be, generally, *much* smaller than a total
number of VMAs within the process. It's a huge speed up in practice.
This is because we pre-sort addresses in user-space, and then we query
VMA for the first address, but then we quickly skip all the other
addresses that are already covered by this VMA, and so the next
request will query a new VMA that covers another subset of addresses.
This way we'll get the minimal number of VMAs that cover captured
addresses (which in the case of stack traces would be a few VMAs
belonging to executable sections of process' binary plus a bunch of
shared libraries).

>
> Thanks,
> Liam
>