[LSF/MM/BPF TOPIC] perf tools issues with BPF

Namhyung Kim <namhyung@xxxxxxxxxx> · Fri, 23 Feb 2024 12:09:29 -0800

Hello,

I'd like to discuss a few BPF issues of perf tools (and the
kernel). The perf tools already make use of BPF programs for various
tracing and filtering work.  While these are all great, there is still
room for improvement.

1. Allowing unprivileged access to BPF for perf events.

The perf_event subsystem allows non-root (!CAP_PERFMON) users to have
events with restrictions in order to measure performance counts for
their processes.  On the other hand, the BPF event filter [1] can be
used to accept or reject samples based on the content of the
sample. It's almost the same as the classic BPF socket filter.  But
without CAP_BPF, normal users cannot use the BPF filter for their perf
events.

I noticed there's ongoing work with the BPF token for unprivileged use
cases but it seems to focus on “trusted” container use cases, and
I'm not sure if this would fit well for the perf use case.  Note that
this case would need to allow random users and therefore, needs
limited functionality to access the given sample data only.

2. Enhancing stack trace

Sometimes it can fail to get build-ID and offset for user stack traces
because of mmap_lock contention.  As BPF programs can run in atomic
context, it cannot wait for the lock to get the build-ID and offset.
Also there’s a chance to get page faults in the user page which also
makes the stack trace stop.

I wonder if we can enhance this situation using the deferred stack
trace proposed for S-Frame [2] last year.  IIUC it wasn’t designed
for BPF in mind but I think it can be useful for stack trace with FP.
Also it would be able to avoid duplication of the same user stacks if
the process runs in the kernel context for a while.  The question is
how to defer and to connect them.

Another (minor) issue with stack trace is to add one more (missing)
helper.  IIUC are 3 stack trace helpers: bpf_get_stack(),
bpf_get_stackid() and bpf_get_task_stack().  But I find that it'd be
useful if there's a helper (bpf_get_task_stackid) to return a single
ID value for a stack trace of the given task.

My use case is perf lock contention tool [3] to get the stack trace of
the owner of contended mutexes.  Currently it just returns the TID of
the owner, but it'd be nice to get the stack trace directly when it
went to sleep.

3. Lock symbol improvements

Actually this is not specific to BPF but for general tracing.  As I
said ‘perf lock contention’ uses BPF on a couple of tracepoints to
track lock contentions in the kernel.  But one of the problems is that
there's no symbol information for the lock.  While the lockdep saves
it in the lock data structure, it's not allowed to do that in
production.  As the tracepoint has the address of the lock instance,
it can check kallsyms for global locks but dynamic locks are not
handled.

Currently it blindly tries to match the address with some well-known
locks (including mmap_lock) from the task struct or global per-cpu
symbols in BPF.  I'm curious if there's a better way to do it.  I was
thinking about BPF iterators to get the address of well-known locks
but it cannot handle all cases and might be racy.

Looking forward to more discussion on the perf and tracing topic.

Thanks,
Namhyung

[1] https://lore.kernel.org/r/20230314234237.3008956-1-namhyung@xxxxxxxxxx/
[2] https://lore.kernel.org/r/d5def69b0c88bcbe2a85d0e1fd6cfca62b472ed4.1699487758.git.jpoimboe@xxxxxxxxxx/
[3] https://lore.kernel.org/r/20230207002403.63590-1-namhyung@xxxxxxxxxx/