On 2024/10/28 14:23, Namhyung Kim wrote: > On Wed, Oct 23, 2024 at 7:11 PM Tengda Wu <wutengda@xxxxxxxxxxxxxxx> wrote: >> >> >> >> On 2024/10/24 6:45, Namhyung Kim wrote: >>> On Tue, Oct 22, 2024 at 05:39:23PM +0800, Tengda Wu wrote: >>>> >>>> >>>> On 2024/10/22 12:08, Namhyung Kim wrote: >>>>> Hello, >>>>> >>>>> On Mon, Oct 21, 2024 at 11:02:00AM +0000, Tengda Wu wrote: >>>>>> bperf has a nice ability to share PMUs, but it still does not support >>>>>> inherit events during fork(), resulting in some deviations in its stat >>>>>> results compared with perf. >>>>>> >>>>>> perf stat result: >>>>>> $ ./perf stat -e cycles,instructions -- ./perf test -w sqrtloop >>>>>> Performance counter stats for './perf test -w sqrtloop': >>>>>> >>>>>> 2,316,038,116 cycles >>>>>> 2,859,350,725 instructions >>>>>> >>>>>> 1.009603637 seconds time elapsed >>>>>> >>>>>> 1.004196000 seconds user >>>>>> 0.003950000 seconds sys >>>>>> >>>>>> bperf stat result: >>>>>> $ ./perf stat --bpf-counters -e cycles,instructions -- \ >>>>>> ./perf test -w sqrtloop >>>>>> >>>>>> Performance counter stats for './perf test -w sqrtloop': >>>>>> >>>>>> 18,762,093 cycles >>>>>> 23,487,766 instructions >>>>>> >>>>>> 1.008913769 seconds time elapsed >>>>>> >>>>>> 1.003248000 seconds user >>>>>> 0.004069000 seconds sys >>>>>> >>>>>> In order to support event inheritance, two new bpf programs are added >>>>>> to monitor the fork and exit of tasks respectively. When a task is >>>>>> created, add it to the filter map to enable counting, and reuse the >>>>>> `accum_key` of its parent task to count together with the parent task. >>>>>> When a task exits, remove it from the filter map to disable counting. >>>>>> >>>>>> After support: >>>>>> $ ./perf stat --bpf-counters -e cycles,instructions -- \ >>>>>> ./perf test -w sqrtloop >>>>>> >>>>>> Performance counter stats for './perf test -w sqrtloop': >>>>>> >>>>>> 2,316,252,189 cycles >>>>>> 2,859,946,547 instructions >>>>>> >>>>>> 1.009422314 seconds time elapsed >>>>>> >>>>>> 1.003597000 seconds user >>>>>> 0.004270000 seconds sys >>>>>> >>>>>> Signed-off-by: Tengda Wu <wutengda@xxxxxxxxxxxxxxx> >>>>>> --- >>>>>> tools/perf/builtin-stat.c | 1 + >>>>>> tools/perf/util/bpf_counter.c | 35 +++++-- >>>>>> tools/perf/util/bpf_skel/bperf_follower.bpf.c | 98 +++++++++++++++++-- >>>>>> tools/perf/util/bpf_skel/bperf_u.h | 5 + >>>>>> tools/perf/util/target.h | 1 + >>>>>> 5 files changed, 126 insertions(+), 14 deletions(-) >>>>>> >>>>>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c >>>>>> index 3e6b9f216e80..8bc880479417 100644 >>>>>> --- a/tools/perf/builtin-stat.c >>>>>> +++ b/tools/perf/builtin-stat.c >>>>>> @@ -2620,6 +2620,7 @@ int cmd_stat(int argc, const char **argv) >>>>>> } else if (big_num_opt == 0) /* User passed --no-big-num */ >>>>>> stat_config.big_num = false; >>>>>> >>>>>> + target.inherit = !stat_config.no_inherit; >>>>>> err = target__validate(&target); >>>>>> if (err) { >>>>>> target__strerror(&target, err, errbuf, BUFSIZ); >>>>>> diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c >>>>>> index 7a8af60e0f51..73fcafbffc6a 100644 >>>>>> --- a/tools/perf/util/bpf_counter.c >>>>>> +++ b/tools/perf/util/bpf_counter.c >>>>>> @@ -394,6 +394,7 @@ static int bperf_check_target(struct evsel *evsel, >>>>>> } >>>>>> >>>>>> static struct perf_cpu_map *all_cpu_map; >>>>>> +static __u32 filter_entry_cnt; >>>>>> >>>>>> static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd, >>>>>> struct perf_event_attr_map_entry *entry) >>>>>> @@ -444,12 +445,32 @@ static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd, >>>>>> return err; >>>>>> } >>>>>> >>>>>> +static int bperf_attach_follower_program(struct bperf_follower_bpf *skel, >>>>>> + enum bperf_filter_type filter_type, >>>>>> + bool inherit) >>>>>> +{ >>>>>> + struct bpf_link *link; >>>>>> + int err = 0; >>>>>> + >>>>>> + if ((filter_type == BPERF_FILTER_PID || >>>>>> + filter_type == BPERF_FILTER_TGID) && inherit) >>>>>> + /* attach all follower bpf progs to enable event inheritance */ >>>>>> + err = bperf_follower_bpf__attach(skel); >>>>>> + else { >>>>>> + link = bpf_program__attach(skel->progs.fexit_XXX); >>>>>> + if (IS_ERR(link)) >>>>>> + err = PTR_ERR(link); >>>>>> + } >>>>>> + >>>>>> + return err; >>>>>> +} >>>>>> + >>>>>> static int bperf__load(struct evsel *evsel, struct target *target) >>>>>> { >>>>>> struct perf_event_attr_map_entry entry = {0xffffffff, 0xffffffff}; >>>>>> int attr_map_fd, diff_map_fd = -1, err; >>>>>> enum bperf_filter_type filter_type; >>>>>> - __u32 filter_entry_cnt, i; >>>>>> + __u32 i; >>>>>> >>>>>> if (bperf_check_target(evsel, target, &filter_type, &filter_entry_cnt)) >>>>>> return -1; >>>>>> @@ -529,9 +550,6 @@ static int bperf__load(struct evsel *evsel, struct target *target) >>>>>> /* set up reading map */ >>>>>> bpf_map__set_max_entries(evsel->follower_skel->maps.accum_readings, >>>>>> filter_entry_cnt); >>>>>> - /* set up follower filter based on target */ >>>>>> - bpf_map__set_max_entries(evsel->follower_skel->maps.filter, >>>>>> - filter_entry_cnt); >>>>>> err = bperf_follower_bpf__load(evsel->follower_skel); >>>>>> if (err) { >>>>>> pr_err("Failed to load follower skeleton\n"); >>>>>> @@ -543,6 +561,7 @@ static int bperf__load(struct evsel *evsel, struct target *target) >>>>>> for (i = 0; i < filter_entry_cnt; i++) { >>>>>> int filter_map_fd; >>>>>> __u32 key; >>>>>> + struct bperf_filter_value fval = { i, 0 }; >>>>>> >>>>>> if (filter_type == BPERF_FILTER_PID || >>>>>> filter_type == BPERF_FILTER_TGID) >>>>>> @@ -553,12 +572,14 @@ static int bperf__load(struct evsel *evsel, struct target *target) >>>>>> break; >>>>>> >>>>>> filter_map_fd = bpf_map__fd(evsel->follower_skel->maps.filter); >>>>>> - bpf_map_update_elem(filter_map_fd, &key, &i, BPF_ANY); >>>>>> + bpf_map_update_elem(filter_map_fd, &key, &fval, BPF_ANY); >>>>>> } >>>>>> >>>>>> evsel->follower_skel->bss->type = filter_type; >>>>>> + evsel->follower_skel->bss->inherit = target->inherit; >>>>>> >>>>>> - err = bperf_follower_bpf__attach(evsel->follower_skel); >>>>>> + err = bperf_attach_follower_program(evsel->follower_skel, filter_type, >>>>>> + target->inherit); >>>>>> >>>>>> out: >>>>>> if (err && evsel->bperf_leader_link_fd >= 0) >>>>>> @@ -623,7 +644,7 @@ static int bperf__read(struct evsel *evsel) >>>>>> bperf_sync_counters(evsel); >>>>>> reading_map_fd = bpf_map__fd(skel->maps.accum_readings); >>>>>> >>>>>> - for (i = 0; i < bpf_map__max_entries(skel->maps.accum_readings); i++) { >>>>>> + for (i = 0; i < filter_entry_cnt; i++) { >>>>>> struct perf_cpu entry; >>>>>> __u32 cpu; >>>>>> >>>>>> diff --git a/tools/perf/util/bpf_skel/bperf_follower.bpf.c b/tools/perf/util/bpf_skel/bperf_follower.bpf.c >>>>>> index f193998530d4..0595063139a3 100644 >>>>>> --- a/tools/perf/util/bpf_skel/bperf_follower.bpf.c >>>>>> +++ b/tools/perf/util/bpf_skel/bperf_follower.bpf.c >>>>>> @@ -5,6 +5,8 @@ >>>>>> #include <bpf/bpf_tracing.h> >>>>>> #include "bperf_u.h" >>>>>> >>>>>> +#define MAX_ENTRIES 102400 >>>>>> + >>>>>> struct { >>>>>> __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); >>>>>> __uint(key_size, sizeof(__u32)); >>>>>> @@ -22,25 +24,29 @@ struct { >>>>>> struct { >>>>>> __uint(type, BPF_MAP_TYPE_HASH); >>>>>> __uint(key_size, sizeof(__u32)); >>>>>> - __uint(value_size, sizeof(__u32)); >>>>>> + __uint(value_size, sizeof(struct bperf_filter_value)); >>>>>> + __uint(max_entries, MAX_ENTRIES); >>>>>> + __uint(map_flags, BPF_F_NO_PREALLOC); >>>>>> } filter SEC(".maps"); >>>>>> >>>>>> enum bperf_filter_type type = 0; >>>>>> int enabled = 0; >>>>>> +int inherit; >>>>>> >>>>>> SEC("fexit/XXX") >>>>>> int BPF_PROG(fexit_XXX) >>>>>> { >>>>>> struct bpf_perf_event_value *diff_val, *accum_val; >>>>>> __u32 filter_key, zero = 0; >>>>>> - __u32 *accum_key; >>>>>> + __u32 accum_key; >>>>>> + struct bperf_filter_value *fval; >>>>>> >>>>>> if (!enabled) >>>>>> return 0; >>>>>> >>>>>> switch (type) { >>>>>> case BPERF_FILTER_GLOBAL: >>>>>> - accum_key = &zero; >>>>>> + accum_key = zero; >>>>>> goto do_add; >>>>>> case BPERF_FILTER_CPU: >>>>>> filter_key = bpf_get_smp_processor_id(); >>>>>> @@ -49,22 +55,34 @@ int BPF_PROG(fexit_XXX) >>>>>> filter_key = bpf_get_current_pid_tgid() & 0xffffffff; >>>>>> break; >>>>>> case BPERF_FILTER_TGID: >>>>>> - filter_key = bpf_get_current_pid_tgid() >> 32; >>>>>> + /* Use pid as the filter_key to exclude new task counts >>>>>> + * when inherit is disabled. Don't worry about the existing >>>>>> + * children in TGID losing their counts, bpf_counter has >>>>>> + * already added them to the filter map via perf_thread_map >>>>>> + * before this bpf prog runs. >>>>>> + */ >>>>>> + filter_key = inherit ? >>>>>> + bpf_get_current_pid_tgid() >> 32 : >>>>>> + bpf_get_current_pid_tgid() & 0xffffffff; >>>>> >>>>> I'm not sure why this is needed. Isn't the existing code fine? >>>> >>>> No, it's not. If I don't modify here, all child threads will always be counted >>>> when inherit is disabled. >>>> >>>> >>>> Before explaining this modification, we may need to first clarify what is included >>>> in the filter map. >>>> >>>> 1. The fexit_XXX prog determines whether to count by filter_key during each >>>> context switch. If the key is found in the filter map, it will be counted, >>>> otherwise not. >>>> 2. The keys in the filter map are synchronized from the perf_thread_map when >>>> bperf__load(). >>>> 3. The threads in perf_thread_map are added through cmd_stat()->evlist__create_maps() >>>> before bperf__load(). >>>> 4. evlist__create_maps() fills perf_thread_map by traversing the /proc/%d/task >>>> directory, and these pids belong to the same tgid. >>>> >>>> Therefore, when the bperf command is issued, the filter map already holds all >>>> existing threads with the same tgid as the specified process. >>>> >>>> >>>> Now, let's take a look at the TGID case. We hope the behavior is as follows: >>>> >>>> * TGID w/ inherit : specified process + all children from the processes >>>> * TGID w/o inherit: specified process (all threads in the process) only >>>> >>>> Assuming that a new thread is created during bperf stats, the new thread should >>>> exhibit the following behavior in the fexit_XXX prog: >>>> >>>> * TGID w/ inherit : do_add >>>> * TGID w/o inherit: skip and return >>>> >>>> Let's now test the code before and after modification. >>>> >>>> Before modification: (filter_key = tgid) >>>> >>>> * TGID w/ inherit: >>>> create : new thread >>>> enter : fexit_XXX prog >>>> assign : filter_key = new thread's tgid >>>> match : bpf_map_lookup_elem(&filter, &filter_key) >>>> do_add >>>> (PASS) >>>> >>>> * TGID w/o inherit: >>>> [...] /* like above */ >>>> do_add >>>> (FAILED, expect skip and return) >>>> >>>> After modification: (filter_key = inherit ? tgid : pid) >>>> >>>> * TGID w/ inherit: >>>> create : new thread >>>> enter : fexit_XXX prog >>>> assign : filter_key = new thread's tgid >>>> match : bpf_map_lookup_elem(&filter, &filter_key) >>>> do_add >>>> (PASS) >>>> >>>> * TGID w/o inherit: >>>> create : new thread >>>> enter : fexit_XXX prog >>>> assign : filter_key = new thread's pid >>>> mismatch: bpf_map_lookup_elem(&filter, &filter_key) >>>> skip and return >>>> (PASS) >>>> >>>> As we can see, filter_key=tgid counts incorrectly in TGID w/o inherit case, >>>> and we need to change it to filter_key=pid to fix it. >>> >>> I'm sorry but I don't think I'm following. A new thread in TGID mode >>> (regardless inherit) should be counted always, right? Why do you >>> expect to skip it? >> >> This is how perf originally performs. To confirm this, I wrote a workload >> that creates one new thread per second and then stat it, as shown below. >> You can see that in 'TGID w/o inherit' case, perf does not count for the >> newly created threads. >> >> Perf TGID w/ inherit: >> --- >> $ ./perf stat -e cpu-clock --timeout 5000 -- ./new_thread_per_second >> thread 367444: start [main] >> thread 367448: start >> thread 367455: start >> thread 367462: start >> thread 367466: start >> thread 367473: start >> ./new_thread_per_second: Terminated >> >> Performance counter stats for './new_thread_per_second': >> >> 10,017.71 msec cpu-clock >> >> 5.005538048 seconds time elapsed >> >> 10.018777000 seconds user >> 0.000000000 seconds sys >> >> Perf TGID w/o inherit: >> --- >> $ ./perf stat -e cpu-clock --timeout 5000 -i -- ./new_thread_per_second >> thread 366679: start [main] >> thread 366686: start >> thread 366693: start >> thread 366697: start >> thread 366704: start >> thread 366708: start >> ./new_thread_per_second: Terminated >> >> Performance counter stats for './new_thread_per_second': >> >> 4.29 msec cpu-clock >> >> 5.005539338 seconds time elapsed >> >> 10.019673000 seconds user >> 0.000000000 seconds sys >> >> >> Therefore, we also need to distinguish it in bperf so that the collection >> results can be consistent with perf. >> >> Bperf TGID w/o inherit: (BEFORE FIX) >> --- >> $ ./perf stat --bpf-counters -e cpu-clock --timeout 5000 -i -- ./new_thread_per_second >> thread 369127: start [main] >> thread 369134: start >> thread 369141: start >> thread 369145: start >> thread 369152: start >> thread 369156: start >> ./new_thread_per_second: Terminated >> >> Performance counter stats for './new_thread_per_second': >> >> 10,019.05 msec cpu-clock >> >> 5.005567266 seconds time elapsed >> >> 10.018528000 seconds user >> 0.000000000 seconds sys >> >> Bperf TGID w/o inherit: (AFTER FIX) >> --- >> $ ./perf stat --bpf-counters -e cpu-clock --timeout 5000 -i -- ./new_thread_per_second >> thread 366616: start [main] >> thread 366623: start >> thread 366627: start >> thread 366634: start >> thread 366638: start >> thread 366645: start >> ./new_thread_per_second: Terminated >> >> Performance counter stats for './new_thread_per_second': >> >> 4.95 msec cpu-clock >> >> 5.005511173 seconds time elapsed >> >> 10.018790000 seconds user >> 0.000000000 seconds sys >> >> >> Thanks, >> Tengda >> > > Thanks for the explanation. Ok I think it's the limitation of the current > implementation of perf_event that works at thread-level. Even if we > can count events at process-level with bperf, it might be important to > keep the compatibility with the existing behavior. > > Thanks, > Namhyung Yeah, thanks for your every review! Best regards, Tengda