Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

"Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx> · Tue, 7 May 2024 14:06:10 -0400

* Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> [240507 12:28]:
> On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
> >
> > .. Adding Suren & Willy to the Cc
> >
> > * Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> [240504 18:14]:
> > > On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > I also did an strace run of both cases. In text-based one the tool did
> > > > > 68 read() syscalls, fetching up to 4KB of data in one go.
> > > >
> > > > Why not fetch more at once?
> > > >
> > >
> > > I didn't expect to be interrogated so much on the performance of the
> > > text parsing front, sorry. :) You can probably tune this, but where is
> > > the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> > > production numbers.
> >
> > The reason the file reads are limited to 4KB is because this file is
> > used for monitoring processes.  We have a significant number of
> > organisations polling this file so frequently that the mmap lock
> > contention becomes an issue. (reading a file is free, right?)  People
> > also tend to try to figure out why a process is slow by reading this
> > file - which amplifies the lock contention.
> >
> > What happens today is that the lock is yielded after 4KB to allow time
> > for mmap writes to happen.  This also means your data may be
> > inconsistent from one 4KB block to the next (the write may be around
> > this boundary).
> >
> > This new interface also takes the lock in do_procmap_query() and does
> > the 4kb blocks as well.  Extending this size means more time spent
> > blocking mmap writes, but a more consistent view of the world (less
> > "tearing" of the addresses).
> 
> Hold on. There is no 4KB in the new ioctl-based API I'm adding. It
> does a single VMA look up (presumably O(logN) operation) using a
> single vma_iter_init(addr) + vma_next() call on vma_iterator.

Sorry, I read this:

+	if (usize > PAGE_SIZE)
+		return -E2BIG;

And thought you were going to return many vmas in that buffer.  I see
now that you are doing one copy at a time.

> 
> As for the mmap_read_lock_killable() (is that what we are talking
> about?), I'm happy to use anything else available, please give me a
> pointer. But I suspect given how fast and small this new API is,
> mmap_read_lock_killable() in it is not comparable to holding it for
> producing /proc/<pid>/maps contents.

Yes, mmap_read_lock_killable() is the mmap lock (formally known as the
mmap sem).

You can see examples of avoiding the mmap lock by use of rcu in
mm/memory.c lock_vma_under_rcu() which is used in the fault path.
userfaultfd has an example as well. But again, remember that not all
archs have this functionality, so you'd need to fall back to full mmap
locking.

Certainly a single lookup and copy will be faster than a 4k buffer
filling copy, but you will be walking the tree O(n) times, where n is
the vma count.  This isn't as efficient as multiple lookups in a row as
we will re-walk from the top of the tree. You will also need to contend
with the fact that the chance of the vmas changing between calls is much
higher here too - if that's an issue. Neither of these issues go away
with use of the rcu locking instead of the mmap lock, but we can be
quite certain that we won't cause locking contention.

Thanks,
Liam