Re: [RFC/PATCHSET 0/7] perf record: Implement BPF sample filter (v1)

Jiri Olsa <olsajiri@xxxxxxxxx> · Tue, 21 Feb 2023 12:54:10 +0100

On Tue, Feb 14, 2023 at 10:01:41AM -0800, Namhyung Kim wrote:
> Hi Ian,
> 
> On Tue, Feb 14, 2023 at 8:58 AM Ian Rogers <irogers@xxxxxxxxxx> wrote:
> >
> > On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> > >
> > > Hello,
> > >
> > > There have been requests for more sophisticated perf event sample
> > > filtering based on the sample data.  Recently the kernel added BPF
> > > programs can access perf sample data and this is the userspace part
> > > to enable such a filtering.
> > >
> > > This still has some rough edges and needs more improvements.  But
> > > I'd like to share the current work and get some feedback for the
> > > directions and idea for further improvements.
> > >
> > > The kernel changes are in the tip.git tree (perf/core branch) for now.
> > > perf record has --filter option to set filters on the last specified
> > > event in the command line.  It worked only for tracepoints and Intel
> > > PT events so far.  This patchset extends it to have 'bpf:' prefix in
> > > order to enable the general sample filters using BPF for any events.
> > >
> > > A new filter expression parser was added (using flex/bison) to process
> > > the filter string.  Right now, it only accepts very simple expressions
> > > separated by comma.  I'd like to keep the filter expression as simple
> > > as possible.
> > >
> > > It requires samples satisfy all the filter expressions otherwise it'd
> > > drop the sample.  IOW filter expressions are connected with logical AND
> > > operations implicitly.
> > >
> > > Essentially the BPF filter expression is:
> > >
> > >   "bpf:" <term> <operator> <value> ("," <term> <operator> <value>)*
> > >
> > > The <term> can be one of:
> > >   ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
> > >   code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
> > >   p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
> > >   mem_dtlb, mem_blk, mem_hops
> > >
> > > The <operator> can be one of:
> > >   ==, !=, >, >=, <, <=, &
> > >
> > > The <value> can be one of:
> > >   <number> (for any term)
> > >   na, load, store, pfetch, exec (for mem_op)
> > >   l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
> > >   na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
> > >   remote (for mem_remote)
> > >   na, locked (for mem_locked)
> > >   na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
> > >   na, by_data, by_addr (for mem_blk)
> > >   hops0, hops1, hops2, hops3 (for mem_hops)
> > >
> > > I plan to improve it with range expressions like for ip or addr and it
> > > should support symbols like the existing addr-filters.  Also cgroup
> > > should understand and convert cgroup names to IDs.

this seems similar to what ftrace is doing in filter_match_preds,
I checked the code briefly and I wonder if we shoud be able to write
that function logic in bpf, assuming that the filter is prepared in
user space

it might solve the 'part' data problem in generic way.. but I might be
missing some blocker of course.. just an idea ;-)

could replace the tracepoint filters.. if we actually care

SNIP

> > > Note that the total aggregated stats show 1 LOST_SAMPLES event but
> > > per event stats show 3991 events because it's the actual number of
> > > dropped samples while the aggregated stats has the number of record.
> > > Maybe we need to change the per-event stats to 'LOST_SAMPLES count'
> > > to avoid the confusion.
> > >
> > > The code is available at 'perf/bpf-filter-v1' branch in my tree.
> > >
> > >   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> > >
> > > Again, you need tip/perf/core kernel for this to work.
> > > Any feedback is welcome.
> >
> > This is great! I wonder about related clean up:

+1

> >
> >  - can we remove BPF events as this is a better feature?
> >    - I believe BPF events are flaky, seldom used (with the exception
> > of the augmented syscalls for perf trace, which really should move to
> > a BPF skeleton as most people don't know how to use it) and they add a
> > bunch of complexity. A particular complexity I care about is that the
> > path separator forward slash ('/') is also the modifier separator for
> > events.
> 
> Well.. I actually never tried the BPF events myself :)
> I think we can deprecate it and get rid of it once the perf trace
> conversion is done.

+1 ;-) would be awesome

jirka