Re: [PATCH v3 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl

Yonghong Song <yhs@xxxxxx> · Wed, 21 Aug 2019 18:43:49 +0000

On 8/21/19 11:31 AM, Peter Zijlstra wrote:
> On Wed, Aug 21, 2019 at 04:54:47PM +0000, Yonghong Song wrote:
>> Currently, in kernel/trace/bpf_trace.c, we have
>>
>> unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
>> {
>>           unsigned int ret;
>>
>>           if (in_nmi()) /* not supported yet */
>>                   return 1;
>>
>>           preempt_disable();
>>
>>           if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
> 
> Yes, I'm aware of that.
> 
>> In the above, the events with bpf program attached will be missed
>> if the context is nmi interrupt, or if some recursion happens even with
>> the same or different bpf programs.
>> In case of recursion, the events will not be sent to ring buffer.
> 
> And while that is significantly worse than what ftrace/perf have, it is
> fundamentally the same thing.
> 
> perf allows (and iirc ftrace does too) 4 nested context per CPU
> (task,softirq,irq,nmi) but any recursion within those context and we
> drop stuff.
> 
> The BPF stuff is just more eager to drop things on the floor, but it is
> fundamentally the same.
> 
>> A lot of bpf-based tracing programs uses maps to communicate and
>> do not allocate ring buffer at all.
> 
> So extending PERF_RECORD_LOST doesn't work. But PERF_FORMAT_LOST might
> still work fine; but you get to implement it for all software events.

Could you give more specifics about PERF_FORMAT_LOST? Googling 
"PERF_FORMAT_LOST" only yields two emails which we are discussing here :-(

> 
>> Maybe we can still use ioctl based approach which is light weighted
>> compared to ring buffer approach? If a fd has bpf attached, nhit/nmisses
>> means the kprobe is processed by bpf program or not.
> 
> There is nothing kprobe specific here. Kprobes just appear to be the
> only one actually accounting the recursion cases, but everyone has
> them.

Sorry to be specific, kprobe is just an example, I actually refers to 
any perf event where bpf can attach to, which theoretically are any
perf events which can be opened with "perf_event_open" syscall although 
some of them (e.g., software events?) may not have bpf running hooks yet.

> 
>> Currently, for debugfs, the nhit/nmisses info is exposed at
>> {k|u}probe_profile. Alternative, we could expose the nhit/nmisses
>> in /proc/self/fdinfo/<fd>. User can query this interface to
>> get numbers.
> 
> No, we're not adding stuff to procfs for this.

No problem. Just a suggestion.