On Mon, Aug 05, 2024 at 04:31:34PM -0300, Arnaldo Carvalho de Melo wrote: > On Mon, Aug 05, 2024 at 11:56:56AM -0700, Namhyung Kim wrote: > > On Mon, Aug 05, 2024 at 12:03:14PM -0300, Arnaldo Carvalho de Melo wrote: > > > On Fri, Aug 02, 2024 at 10:37:52AM -0700, Namhyung Kim wrote: > > > > So far it used tgid as a key to get the filter expressions in the > > > > pinned filters map for regular users but it won't work well if the has > > > > more than one filters at the same time. Let's add the event id to the > > > > key of the filter hash map so that it can identify the right filter > > > > expression in the BPF program. > > > > > > > > As the event can be inherited to child tasks, it should use the primary > > > > id which belongs to the parent (original) event. Since evsel opens the > > > > event for multiple CPUs and tasks, it needs to maintain a separate hash > > > > map for the event id. > > > > > > I'm trying to test it now, it would be nice to have the series of events > > > needed to test that the feature is working. > > > > Sure, I used the following command. > > > > ./perf record -e cycles --filter 'ip < 0xffffffff00000000' -e instructions --filter 'period < 100000' -o- ./perf test -w noploop | ./perf script -i- > > Thanks > > > > > > > Some comments below. > > > > > > > In the user space, it keeps a list for the multiple evsel and release > > > > the entries in the both hash map when it closes the event. > > > > > > > > Signed-off-by: Namhyung Kim <namhyung@xxxxxxxxxx> > > > > --- > > > > tools/perf/util/bpf-filter.c | 288 ++++++++++++++++--- > > > > tools/perf/util/bpf_skel/sample-filter.h | 11 +- > > > > tools/perf/util/bpf_skel/sample_filter.bpf.c | 42 ++- > > > > tools/perf/util/bpf_skel/vmlinux/vmlinux.h | 5 + > > > > 4 files changed, 304 insertions(+), 42 deletions(-) > > > > > > > > diff --git a/tools/perf/util/bpf-filter.c b/tools/perf/util/bpf-filter.c > > > > index c5eb0b7eec19..69b147cba969 100644 > > > > --- a/tools/perf/util/bpf-filter.c > > > > +++ b/tools/perf/util/bpf-filter.c > > > > @@ -1,4 +1,45 @@ > > > > /* SPDX-License-Identifier: GPL-2.0 */ > > > > +/** > > > > + * Generic event filter for sampling events in BPF. > > > > + * > > > > + * The BPF program is fixed and just to read filter expressions in the 'filters' > > > > + * map and compare the sample data in order to reject samples that don't match. > > > > + * Each filter expression contains a sample flag (term) to compare, an operation > > > > + * (==, >=, and so on) and a value. > > > > + * > > > > + * Note that each entry has an array of filter repxressions and it only succeeds > > > > > > expressions > > > > Oops, thanks. > > > > > > > > > + * when all of the expressions are satisfied. But it supports the logical OR > > > > + * using a GROUP operation which is satisfied when any of its member expression > > > > + * is evaluated to true. But it doesn't allow nested GROUP operations for now. > > > > + * > > > > + * To support non-root users, the filters map can be loaded and pinned in the BPF > > > > + * filesystem by root (perf record --setup-filter pin). Then each user will get > > > > + * a new entry in the shared filters map to fill the filter expressions. And the > > > > + * BPF program will find the filter using (task-id, event-id) as a key. > > > > + * > > > > + * The pinned BPF object (shared for regular users) has: > > > > + * > > > > + * event_hash | > > > > + * | | | > > > > + * event->id ---> | id | ---+ idx_hash | filters > > > > + * | | | | | | | | > > > > + * | .... | +-> | idx | --+--> | exprs | ---> perf_bpf_filter_entry[] > > > > + * | | | | | | .op > > > > + * task id (tgid) --------------+ | .... | | | ... | .term (+ part) > > > > + * | .value > > > > + * | > > > > + * ======= (root would skip this part) ======== (compares it in a loop) > > > > + * > > > > + * This is used for per-task use cases while system-wide profiling (normally from > > > > + * root user) uses a separate copy of the program and the maps for its own so that > > > > + * it can proceed even if a lot of non-root users are using the filters at the > > > > + * same time. In this case the filters map has a single entry and no need to use > > > > + * the hash maps to get the index (key) of the filters map (IOW it's always 0). > > > > + * > > > > + * The BPF program returns 1 to accept the sample or 0 to drop it. > > > > + * The 'dropped' map is to keep how many samples it dropped by the filter and > > > > + * it will be reported as lost samples. > > > > > > I think there is value in reporting how many were filtered out, I'm just > > > unsure about reporting it as "lost" samples, as lost has another > > > semantic associated, i.e. ring buffer was full or couldn't process it > > > for some other resource starvation issue, no? > > > > Then we need a way to save the information. It could be a new record > > type (PERF_RECORD_DROPPED_SAMPLES), a new misc flag in the lost samples > > I guess "PERF_RECORD_FILTERED_SAMPLES" would be better, more precise, > wdyt? > > > record or a header field. I prefer the misc flag. > > I think we can have both filtered and lost samples, so I would prefer > the new record type. I think we can have two LOST_SAMPLES records then - one with the new misc flag for BPF and the other (without the flag) for the usual lost samples. This would require minimal changes IMHO. Thanks, Namhyung > > > Also there should be a separate PERF_RECORD_LOST record in the middle > > when there's actual issue. Without that we can say it's from the BPF. > > Right, disambiguating filtered from lost samples is indeed useful. > > - Arnaldo > > > Thanks, > > Namhyung