Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Sat, 4 May 2024 15:13:25 -0700

On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > I also did an strace run of both cases. In text-based one the tool did
> > 68 read() syscalls, fetching up to 4KB of data in one go.
>
> Why not fetch more at once?
>

I didn't expect to be interrogated so much on the performance of the
text parsing front, sorry. :) You can probably tune this, but where is
the reasonable limit? 64KB? 256KB? 1MB? See below for some more
production numbers.

> And I have a fun 'readfile()' syscall implementation around here that
> needs justification to get merged (I try so every other year or so) that
> can do the open/read/close loop in one call, with the buffer size set by
> userspace if you really are saying this is a "hot path" that needs that
> kind of speedup.  But in the end, io_uring usually is the proper api for
> that instead, why not use that here instead of slow open/read/close if
> you care about speed?
>

I'm not sure what I need to say here. I'm sure it will be useful, but
as I already explained, it's not about the text file or not, it's
about having to read too much information that's completely
irrelevant. Again, see below for another data point.

> > In comparison,
> > ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> > relevant VMAs.
> >
> > It is projected that savings from processing big production applications
> > would only widen the gap in favor of binary-based querying ioctl API, as
> > bigger applications will tend to have even more non-executable VMA
> > mappings relative to executable ones.
>
> Define "bigger applications" please.  Is this some "large database
> company workload" type of thing, or something else?

I don't have a definition. But I had in mind, as one example, an
ads-serving service we use internally (it's a pretty large application
by pretty much any metric you can come up with). I just randomly
picked one of the production hosts, found one instance of that
service, and looked at its /proc/<pid>/maps file. Hopefully it will
satisfy your need for specifics.

# cat /proc/1126243/maps | wc -c
1570178
# cat /proc/1126243/maps | wc -l
28875
# cat /proc/1126243/maps | grep ' ..x. ' | wc -l
7347

You can see that maps file itself is about 1.5MB of text (which means
single-shot reading of its entire contents is a bit unrealistic,
though, sure, why not). The process contains 28875 VMAs, out of which
only 7347 are executable.

This means if we were to profile this process (and normally we profile
entire system, so it's almost never single /proc/<pid>/maps file that
needs to be open and processed), we'd need *at most* (absolute worst
case!) 7347/28875 = 25.5% of entries. In reality, most code will be
concentrated in a much smaller number of executable VMAs, of course.
But no, I don't have specific numbers at hand, sorry.

It matters less whether it's text or binary (though binary undoubtedly
will be faster, it's strange to even argue about this), it's the
ability to fetch only relevant VMAs that is the point here.

>
> thanks,
>
> greg k-h