Adding cc to bpf@xxxxxxxxxxxxxxx mailing list since this is really bpf related. On 9/17/19 10:24 PM, jinshan.xiong@xxxxxxxxx wrote: > From: Jinshan Xiong <jinshan.xiong@xxxxxxxxx> > > Invoking bpf program only if kprobe based perf_event has been added into > the percpu list. This is essential to make event tracing for cgroup to work > properly. The issue actually exists for bpf programs with kprobe, uprobe, tracepoint and trace_syscall_enter/exit. In all these places, bpf program is called regardless of whether there are perf events or not on this cpu. This provides bpf programs more opportunities to see the events. I guess this is by design. Alexei/Daniel, could you clarify? This, unfortunately, has a consequence on cgroup. Currently, perf event cgroup based on filtering (PERF_FLAG_PID_CGROUP) won't work for bpf programs with kprobee/uprobe/tracepoint/trace_syscall. The reason is the same, bpf programs see more events than what perf has configured. the overflow interrupt (nmi) based perf_event bpf programs are not impacted. Any suggestions on what is the proper way to move forward? One way to start to honor events only permitted by perf like what this patch did. But I am not sure whether this contradicts the original motivation for bpf programs to see all events regardless. Another way is to do filtering inside bpf program. We already have bpf_get_cgroup_id() helper. We may need another helper to check whether the current is (a descendant of) another cgroup as it is often the cases to trace all processes under one parent cgroup which may have many child cgroups. > > Signed-off-by: Jinshan Xiong <jinshan.xiong@xxxxxxxxx> > --- > kernel/trace/trace_kprobe.c | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c > index 9d483ad9bb6c..40ef0f1945f7 100644 > --- a/kernel/trace/trace_kprobe.c > +++ b/kernel/trace/trace_kprobe.c > @@ -1171,11 +1171,18 @@ static int > kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs) > { > struct trace_event_call *call = trace_probe_event_call(&tk->tp); > + struct hlist_head *head = this_cpu_ptr(call->perf_events); > struct kprobe_trace_entry_head *entry; > - struct hlist_head *head; > int size, __size, dsize; > int rctx; > > + /* > + * If head is empty, the process currently running on this cpu is not > + * interested by kprobe perf PMU. > + */ > + if (hlist_empty(head)) > + return 0; > + > if (bpf_prog_array_valid(call)) { > unsigned long orig_ip = instruction_pointer(regs); > int ret; > @@ -1193,10 +1200,6 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs) > return 0; > } > > - head = this_cpu_ptr(call->perf_events); > - if (hlist_empty(head)) > - return 0; > - > dsize = __get_data_size(&tk->tp, regs); > __size = sizeof(*entry) + tk->tp.size + dsize; > size = ALIGN(__size + sizeof(u32), sizeof(u64)); >