Re: [PATCH 2/6] perf record: Enable off-cpu analysis with BPF

Namhyung Kim <namhyung@xxxxxxxxxx> · Thu, 19 May 2022 14:01:25 -0700

Hi Ian,

On Wed, May 18, 2022 at 8:58 PM Ian Rogers <irogers@xxxxxxxxxx> wrote:
>
> On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> >
> > Add --off-cpu option to enable the off-cpu profiling with BPF.  It'd
> > use a bpf_output event and rename it to "offcpu-time".  Samples will
> > be synthesized at the end of the record session using data from a BPF
> > map which contains the aggregated off-cpu time at context switches.
> > So it needs root privilege to get the off-cpu profiling.
> >
> > Each sample will have a separate user stacktrace so it will skip
> > kernel threads.  The sample ip will be set from the stacktrace and
> > other sample data will be updated accordingly.  Currently it only
> > handles some basic sample types.
> >
> > The sample timestamp is set to a dummy value just not to bother with
> > other events during the sorting.  So it has a very big initial value
> > and increase it on processing each samples.
> >
> > Good thing is that it can be used together with regular profiling like
> > cpu cycles.  If you don't want to that, you can use a dummy event to
> > enable off-cpu profiling only.
> >
> > Example output:
> >   $ sudo perf record --off-cpu perf bench sched messaging -l 1000
> >
> >   $ sudo perf report --stdio --call-graph=no
> >   # Total Lost Samples: 0
> >   #
> >   # Samples: 41K of event 'cycles'
> >   # Event count (approx.): 42137343851
> >   ...
> >
> >   # Samples: 1K of event 'offcpu-time'
> >   # Event count (approx.): 587990831640
> >   #
> >   # Children      Self  Command          Shared Object       Symbol
> >   # ........  ........  ...............  ..................  .........................
> >   #
> >       81.66%     0.00%  sched-messaging  libc-2.33.so        [.] __libc_start_main
> >       81.66%     0.00%  sched-messaging  perf                [.] cmd_bench
> >       81.66%     0.00%  sched-messaging  perf                [.] main
> >       81.66%     0.00%  sched-messaging  perf                [.] run_builtin
> >       81.43%     0.00%  sched-messaging  perf                [.] bench_sched_messaging
> >       40.86%    40.86%  sched-messaging  libpthread-2.33.so  [.] __read
> >       37.66%    37.66%  sched-messaging  libpthread-2.33.so  [.] __write
> >        2.91%     2.91%  sched-messaging  libc-2.33.so        [.] __poll
> >   ...
> >
> > As you can see it spent most of off-cpu time in read and write in
> > bench_sched_messaging().  The --call-graph=no was added just to make
> > the output concise here.
> >
> > It uses perf hooks facility to control BPF program during the record
> > session rather than adding new BPF/off-cpu specific calls.
> >
> > Signed-off-by: Namhyung Kim <namhyung@xxxxxxxxxx>
>
> Acked-by: Ian Rogers <irogers@xxxxxxxxxx>
>
> > ---
> >  tools/perf/Documentation/perf-record.txt |  10 ++
> >  tools/perf/Makefile.perf                 |   1 +
> >  tools/perf/builtin-record.c              |  25 +++
> >  tools/perf/util/Build                    |   1 +
> >  tools/perf/util/bpf_off_cpu.c            | 204 +++++++++++++++++++++++
> >  tools/perf/util/bpf_skel/off_cpu.bpf.c   | 139 +++++++++++++++
> >  tools/perf/util/off_cpu.h                |  24 +++
> >  7 files changed, 404 insertions(+)
> >  create mode 100644 tools/perf/util/bpf_off_cpu.c
> >  create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
> >  create mode 100644 tools/perf/util/off_cpu.h
> >
> > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > index 465be4e62a17..b4e9ef7edfef 100644
> > --- a/tools/perf/Documentation/perf-record.txt
> > +++ b/tools/perf/Documentation/perf-record.txt
> > @@ -758,6 +758,16 @@ include::intel-hybrid.txt[]
> >         If the URLs is not specified, the value of DEBUGINFOD_URLS
> >         system environment variable is used.
> >
> > +--off-cpu::
> > +       Enable off-cpu profiling with BPF.  The BPF program will collect
> > +       task scheduling information with (user) stacktrace and save them
> > +       as sample data of a software event named "offcpu-time".  The
> > +       sample period will have the time the task slept in nanoseconds.
> > +
> > +       Note that BPF can collect stack traces using frame pointer ("fp")
> > +       only, as of now.  So the applications built without the frame
> > +       pointer might see bogus addresses.
> > +
> >  SEE ALSO
> >  --------
> >  linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1]
> > diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> > index 6e5aded855cc..8f738e11356d 100644
> > --- a/tools/perf/Makefile.perf
> > +++ b/tools/perf/Makefile.perf
> > @@ -1038,6 +1038,7 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
> >  SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
> >  SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
> >  SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
> > +SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
> >
> >  $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
> >         $(Q)$(MKDIR) -p $@
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index a5cf6a99d67f..91f88501412e 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -49,6 +49,7 @@
> >  #include "util/clockid.h"
> >  #include "util/pmu-hybrid.h"
> >  #include "util/evlist-hybrid.h"
> > +#include "util/off_cpu.h"
> >  #include "asm/bug.h"
> >  #include "perf.h"
> >  #include "cputopo.h"
> > @@ -162,6 +163,7 @@ struct record {
> >         bool                    buildid_mmap;
> >         bool                    timestamp_filename;
> >         bool                    timestamp_boundary;
> > +       bool                    off_cpu;
> >         struct switch_output    switch_output;
> >         unsigned long long      samples;
> >         unsigned long           output_max_size;        /* = 0: unlimited */
> > @@ -903,6 +905,11 @@ static int record__config_text_poke(struct evlist *evlist)
> >         return 0;
> >  }
> >
> > +static int record__config_off_cpu(struct record *rec)
> > +{
> > +       return off_cpu_prepare(rec->evlist);
> > +}
> > +
> >  static bool record__kcore_readable(struct machine *machine)
> >  {
> >         char kcore[PATH_MAX];
> > @@ -2600,6 +2607,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> >         } else
> >                 status = err;
> >
> > +       if (rec->off_cpu)
> > +               rec->bytes_written += off_cpu_write(rec->session);
> > +
> >         record__synthesize(rec, true);
> >         /* this will be recalculated during process_buildids() */
> >         rec->samples = 0;
> > @@ -3324,6 +3334,7 @@ static struct option __record_options[] = {
> >         OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
> >                             "write collected trace data into several data files using parallel threads",
> >                             record__parse_threads),
> > +       OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
> >         OPT_END()
> >  };
> >
> > @@ -3743,6 +3754,12 @@ int cmd_record(int argc, const char **argv)
> >         set_nobuild('\0', "vmlinux", true);
> >  # undef set_nobuild
> >  # undef REASON
> > +#endif
> > +
> > +#ifndef HAVE_BPF_SKEL
> > +# define set_nobuild(s, l, m, c) set_option_nobuild(record_options, s, l, m, c)
> > +       set_nobuild('\0', "off-cpu", "no BUILD_BPF_SKEL=1", true);
> > +# undef set_nobuild
> >  #endif
> >
> >         rec->opts.affinity = PERF_AFFINITY_SYS;
> > @@ -3981,6 +3998,14 @@ int cmd_record(int argc, const char **argv)
> >                 }
> >         }
> >
> > +       if (rec->off_cpu) {
> > +               err = record__config_off_cpu(rec);
> > +               if (err) {
> > +                       pr_err("record__config_off_cpu failed, error %d\n", err);
> > +                       goto out;
> > +               }
> > +       }
> > +
> >         if (record_opts__config(&rec->opts)) {
> >                 err = -EINVAL;
> >                 goto out;
> > diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> > index 9a7209a99e16..a51267d88ca9 100644
> > --- a/tools/perf/util/Build
> > +++ b/tools/perf/util/Build
> > @@ -147,6 +147,7 @@ perf-$(CONFIG_LIBBPF) += bpf_map.o
> >  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
> >  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
> >  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
> > +perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
> >  perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
> >  perf-$(CONFIG_LIBELF) += symbol-elf.o
> >  perf-$(CONFIG_LIBELF) += probe-file.o
> > diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> > new file mode 100644
> > index 000000000000..9ed7aca3f4ac
> > --- /dev/null
> > +++ b/tools/perf/util/bpf_off_cpu.c
> > @@ -0,0 +1,204 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "util/bpf_counter.h"
> > +#include "util/debug.h"
> > +#include "util/evsel.h"
> > +#include "util/evlist.h"
> > +#include "util/off_cpu.h"
> > +#include "util/perf-hooks.h"
> > +#include "util/session.h"
> > +#include <bpf/bpf.h>
> > +
> > +#include "bpf_skel/off_cpu.skel.h"
> > +
> > +#define MAX_STACKS  32
> > +/* we don't need actual timestamp, just want to put the samples at last */
> > +#define OFF_CPU_TIMESTAMP  (~0ull << 32)
> > +
> > +static struct off_cpu_bpf *skel;
> > +
> > +struct off_cpu_key {
> > +       u32 pid;
> > +       u32 tgid;
> > +       u32 stack_id;
> > +       u32 state;
> > +};
> > +
> > +union off_cpu_data {
> > +       struct perf_event_header hdr;
> > +       u64 array[1024 / sizeof(u64)];
> > +};
> > +
> > +static int off_cpu_config(struct evlist *evlist)
> > +{
> > +       struct evsel *evsel;
> > +       struct perf_event_attr attr = {
> > +               .type   = PERF_TYPE_SOFTWARE,
> > +               .config = PERF_COUNT_SW_BPF_OUTPUT,
> > +               .size   = sizeof(attr), /* to capture ABI version */
> > +       };
> > +       char *evname = strdup(OFFCPU_EVENT);
> > +
> > +       if (evname == NULL)
> > +               return -ENOMEM;
> > +
> > +       evsel = evsel__new(&attr);
> > +       if (!evsel) {
> > +               free(evname);
> > +               return -ENOMEM;
> > +       }
> > +
> > +       evsel->core.attr.freq = 1;
> > +       evsel->core.attr.sample_period = 1;
> > +       /* off-cpu analysis depends on stack trace */
> > +       evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
> > +
> > +       evlist__add(evlist, evsel);
> > +
> > +       free(evsel->name);
> > +       evsel->name = evname;
> > +
> > +       return 0;
> > +}
> > +
> > +static void off_cpu_start(void *arg __maybe_unused)
> > +{
> > +       skel->bss->enabled = 1;
> > +}
> > +
> > +static void off_cpu_finish(void *arg __maybe_unused)
> > +{
> > +       skel->bss->enabled = 0;
> > +       off_cpu_bpf__destroy(skel);
> > +}
> > +
> > +int off_cpu_prepare(struct evlist *evlist)
> > +{
> > +       int err;
> > +
> > +       if (off_cpu_config(evlist) < 0) {
> > +               pr_err("Failed to config off-cpu BPF event\n");
> > +               return -1;
> > +       }
> > +
> > +       set_max_rlimit();
> > +
> > +       skel = off_cpu_bpf__open_and_load();
> > +       if (!skel) {
> > +               pr_err("Failed to open off-cpu BPF skeleton\n");
> > +               return -1;
> > +       }
> > +
> > +       err = off_cpu_bpf__attach(skel);
> > +       if (err) {
> > +               pr_err("Failed to attach off-cpu BPF skeleton\n");
> > +               goto out;
> > +       }
> > +
> > +       if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
> > +           perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
>
> An off-topic thought here. I was looking at tool events and thinking
> that rather than have global state and the counter reading getting
> those global values, it would be nice if the tool events had private
> state that was created on open, freed on close and then did the
> appropriate thing for enable/read/disable. If we were object oriented
> I was thinking tool events could be a subclass of evsel that would
> override the appropriate functions. Something that strikes me as silly
> is that tool events have a dummy event file descriptor from
> perf_event_open, in part because the evsel code paths are shared.
> Anyway, if we made evsels have something like a virtual method table,
> we could do similar tricks with off-cpu and BPF created events.
> Possibly this could lead to better structured code and more reuse.

Thanks for your review.

Yeah I think it makes sense to change the offcpu-time event as a
tool event and to skip actual open/read/close operations.  It would
be a good follow up work.

Using a dummy event still has a syscall overhead and tool events
don't need the syscalls.  Maybe we can add some logic in the evlist
code to skip the tool events.

Thanks,
Namhyung