Re: [PATCHv3 bpf-next 0/9] mm/bpf/perf: Store build id in file object

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 17 Mar 2023 23:08:44 -0700

On Fri, Mar 17, 2023 at 2:21 PM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Mar 17, 2023 at 09:14:03PM +0000, Al Viro wrote:
> > On Fri, Mar 17, 2023 at 09:33:17AM -0700, Andrii Nakryiko wrote:
> >
> > > > But build IDs are _generally_ available.  The only problem (AIUI)
> > > > is when you're trying to examine the contents of one container from
> > > > another container.  And to solve that problem, you're imposing a cost
> > > > on everybody else with (so far) pretty vague justifications.  I really
> > > > don't like to see you growing struct file for this (nor struct inode,
> > > > nor struct vm_area_struct).  It's all quite unsatisfactory and I don't
> > > > have a good suggestion.
> > >
> > > There is a lot of profiling, observability and debugging tooling built
> > > using BPF. And when capturing stack traces from BPF programs, if the
> > > build ID note is not physically present in memory, fetching it from
> > > the BPF program might fail in NMI (and other non-faultable contexts).
> > > This patch set is about making sure we always can fetch build ID, even
> > > from most restrictive environments. It's guarded by Kconfig to avoid
> > > adding 8 bytes of overhead to struct file for environment where this
> > > might be unacceptable, giving users and distros a choice.
> >
> > Lovely.  As an exercise you might want to collect the stats on the
> > number of struct file instances on the system vs. the number of files
> > that happen to be ELF objects and are currently mmapped anywhere.

That's a good suggestion. I wrote a simple script that uses the drgn
tool ([0]), it enables nice introspection of the state of the kernel
memory for the running kernel. The script is at the bottom ([1]) for
anyone to sanity check. I didn't try to figure out which file is
mmaped as executable and which didn't, so let's do worst case and
assume that none of the file is executable, and thus that 8 byte
pointer is a waste for all of them.

On my devserver I got:

task_cnt=15984 uniq_file_cnt=56780

On randomly chosen production host I got:

task_cnt=6387 uniq_file_cnt=22514

So it seems like my devserver is "busier" than the production host. :)

Above numbers suggest that my devserver's kernel has about 57000
*unique* `struct file *` instances. That's 450KB of overhead. That's
not much by any modern standard.

But let's say I'm way off, and we have 1 million struct files. That's
8MB overhead. I'd argue that those 8MB is not a big deal even on a
normal laptop, even less so on production servers. Especially if you
have 1 million active struct file instances created in the system, as
way more will be used for application-specific needs.

> > That does depend upon the load, obviously, but it's not hard to collect -
> > you already have more than enough hooks inserted in the relevant places.
> > That might give a better appreciation of the reactions...
>
> One possibility would be a bit stolen from inode flags + hash keyed by
> struct inode address (middle bits make for a decent hash function);
> inode eviction would check that bit and kick the corresponding thing
> from hash if the bit is set.
>
> Associating that thing with inode => hash lookup/insert + set the bit.

This is an interesting idea, but now we are running into a few
unnecessary problems. We need to have a global dynamically sized hash
map in the system. If we fix the number of buckets, we risk either
wasting memory on an underutilized system (if we oversize), or
performance problems due to collisions (if we undersize) if we have a
busy system with lots of executables mapped in memory. If we don't
pre-size, then we are talking about reallocations, rehashing, and
doing that under global lock or something like that. Further, we'd
have to take locks on buckets, which causes further problems for
looking up build ID from this hashmap in NMI context for perf events
and BPF programs, as locks can't be safely taken under those
conditions, and thus fetching build ID would still be unreliable
(though less so than it is today, of course).

All of this is solvable to some degree (but not perfectly and not with
simple and elegant approaches), but seems like an unnecessarily
overcomplication compared to the amount of memory that we hope to
save. It still feels like a Kconfig-guarded 8 byte field per struct
file is a reasonable price for gaining reliable build ID information
for profiling/tracing tools.

  [0] https://drgn.readthedocs.io/en/latest/index.html

  [1] Script I used:

from drgn.helpers.linux.pid import for_each_task
from drgn.helpers.linux.fs import for_each_file

task_cnt = 0
file_set = set()

for task in for_each_task(prog):
    task_cnt += 1
    try:
        for (fd, file) in for_each_file(task):
            file_set.add(file.value_())
    except:
        pass

uniq_file_cnt = len(file_set)
print(f"task_cnt={task_cnt} uniq_file_cnt={uniq_file_cnt}")