Re: [PATCH v3 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl

Yonghong Song <yhs@xxxxxx> · Wed, 21 Aug 2019 16:54:47 +0000

On 8/21/19 4:08 AM, Peter Zijlstra wrote:
> On Tue, Aug 20, 2019 at 10:58:47AM -0700, Daniel Xu wrote:
>> Hi Peter,
>>
>> On Tue Aug 20, 2019 at 4:45 PM Peter Zijlstra wrote:
>>> On Fri, Aug 16, 2019 at 03:31:46PM -0700, Daniel Xu wrote:
>>>> It's useful to know [uk]probe's nmissed and nhit stats. For example with
>>>> tracing tools, it's important to know when events may have been lost.
>>>> debugfs currently exposes a control file to get this information, but
>>>> it is not compatible with probes registered with the perf API.
>>>
>>> What is this nmissed and nhit stuff?
>>
>> nmissed is the number of times the probe's handler should have been run
>> but didn't. nhit is the number of times the probes handler has run. I've
>> documented this information in the uapi header. If you'd like, I can put
>> it in the commit message too.
> 
> That comment just says: 'number of times this probe was temporarily
> disabled', which says exactly nothing.
> 
> But reading the kprobe code seems to suggest this happens on recursive
> kprobes, which I'm thinking is a dodgy situation in the first place.
> 
> ftrace and perf in general don't keep counts of events lost due to
> recursion, so why should we do this for kprobes? Also, while you write
> to support uprobes, it doesn't actually suffer from this (it cannot,
> uprobes cannot recurse), so supporting it makes no sense.
> 
> And with that, the name QUERY_PROBE also makes no sense, because it is
> not specific to [uk]probes, all software events suffer this.
> 
> And I'm not sure an additional ioctl() is the right way, supposing we
> want to expose this at all. You've mentioned no alternative approached,
> I'm thinking PERF_FORMAT_LOST might be possible, or maybe a
> PERF_RECORD_LOST extention.

Things get more complicated when bpf program is executing to replace
ring buffer output.

Currently, in kernel/trace/bpf_trace.c, we have

unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
{
         unsigned int ret;

         if (in_nmi()) /* not supported yet */
                 return 1;

         preempt_disable();

         if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
                 /*
                  * since some bpf program is already running on this cpu,
                  * don't call into another bpf program (same or different)
                  * and don't send kprobe event into ring-buffer,
                  * so return zero here
                  */
                 ret = 0;
                 goto out;
         }
.....

In the above, the events with bpf program attached will be missed
if the context is nmi interrupt, or if some recursion happens even with 
the same or different bpf programs.
In case of recursion, the events will not be sent to ring buffer.

A lot of bpf-based tracing programs uses maps to communicate and
do not allocate ring buffer at all.

Maybe we can still use ioctl based approach which is light weighted
compared to ring buffer approach? If a fd has bpf attached, nhit/nmisses
means the kprobe is processed by bpf program or not.

Currently, for debugfs, the nhit/nmisses info is exposed at
{k|u}probe_profile. Alternative, we could expose the nhit/nmisses
in /proc/self/fdinfo/<fd>. User can query this interface to
get numbers.

Arnaldo has a question on bcc mailing list about the hit/miss
counting of bpf program missed to process events.

https://lists.iovisor.org/g/iovisor-dev/message/1783

Comments?

> 
> Of course, then you get to implement it for tracepoints and software
> events too.
>