Re: [PATCH] staging: tracing/kprobe: filter kprobe based perf event

Yonghong Song <yhs@xxxxxx> · Wed, 18 Sep 2019 05:51:10 +0000

Adding cc to bpf@xxxxxxxxxxxxxxx mailing list since this is really
bpf related.

On 9/17/19 10:24 PM, jinshan.xiong@xxxxxxxxx wrote:
> From: Jinshan Xiong <jinshan.xiong@xxxxxxxxx>
> 
> Invoking bpf program only if kprobe based perf_event has been added into
> the percpu list. This is essential to make event tracing for cgroup to work
> properly.

The issue actually exists for bpf programs with kprobe, uprobe, 
tracepoint and trace_syscall_enter/exit.

In all these places, bpf program is called regardless of
whether there are perf events or not on this cpu.
This provides bpf programs more opportunities to see
the events. I guess this is by design.
Alexei/Daniel, could you clarify?

This, unfortunately, has a consequence on cgroup.
Currently, perf event cgroup based on filtering
(PERF_FLAG_PID_CGROUP) won't work for bpf programs
with kprobee/uprobe/tracepoint/trace_syscall.
The reason is the same, bpf programs see
more events than what perf has configured.

the overflow interrupt (nmi) based perf_event bpf programs
are not impacted.

Any suggestions on what is the proper way to move
forward?

One way to start to honor events only permitted by perf
like what this patch did. But I am not sure whether this
contradicts the original motivation for bpf programs
to see all events regardless.

Another way is to do filtering inside bpf program.
We already have bpf_get_cgroup_id() helper.
We may need another helper to check whether the current
is (a descendant of) another cgroup as it is often the cases
to trace all processes under one parent cgroup which may have many
child cgroups.

> 
> Signed-off-by: Jinshan Xiong <jinshan.xiong@xxxxxxxxx>
> ---
>   kernel/trace/trace_kprobe.c | 13 ++++++++-----
>   1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 9d483ad9bb6c..40ef0f1945f7 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -1171,11 +1171,18 @@ static int
>   kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
>   {
>   	struct trace_event_call *call = trace_probe_event_call(&tk->tp);
> +	struct hlist_head *head = this_cpu_ptr(call->perf_events);
>   	struct kprobe_trace_entry_head *entry;
> -	struct hlist_head *head;
>   	int size, __size, dsize;
>   	int rctx;
>   
> +	/*
> +	 * If head is empty, the process currently running on this cpu is not
> +	 * interested by kprobe perf PMU.
> +	 */
> +	if (hlist_empty(head))
> +		return 0;
> +
>   	if (bpf_prog_array_valid(call)) {
>   		unsigned long orig_ip = instruction_pointer(regs);
>   		int ret;
> @@ -1193,10 +1200,6 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
>   			return 0;
>   	}
>   
> -	head = this_cpu_ptr(call->perf_events);
> -	if (hlist_empty(head))
> -		return 0;
> -
>   	dsize = __get_data_size(&tk->tp, regs);
>   	__size = sizeof(*entry) + tk->tp.size + dsize;
>   	size = ALIGN(__size + sizeof(u32), sizeof(u64));
>