On 8/21/19 11:31 AM, Peter Zijlstra wrote: > On Wed, Aug 21, 2019 at 04:54:47PM +0000, Yonghong Song wrote: >> Currently, in kernel/trace/bpf_trace.c, we have >> >> unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) >> { >> unsigned int ret; >> >> if (in_nmi()) /* not supported yet */ >> return 1; >> >> preempt_disable(); >> >> if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) { > > Yes, I'm aware of that. > >> In the above, the events with bpf program attached will be missed >> if the context is nmi interrupt, or if some recursion happens even with >> the same or different bpf programs. >> In case of recursion, the events will not be sent to ring buffer. > > And while that is significantly worse than what ftrace/perf have, it is > fundamentally the same thing. > > perf allows (and iirc ftrace does too) 4 nested context per CPU > (task,softirq,irq,nmi) but any recursion within those context and we > drop stuff. > > The BPF stuff is just more eager to drop things on the floor, but it is > fundamentally the same. > >> A lot of bpf-based tracing programs uses maps to communicate and >> do not allocate ring buffer at all. > > So extending PERF_RECORD_LOST doesn't work. But PERF_FORMAT_LOST might > still work fine; but you get to implement it for all software events. Could you give more specifics about PERF_FORMAT_LOST? Googling "PERF_FORMAT_LOST" only yields two emails which we are discussing here :-( > >> Maybe we can still use ioctl based approach which is light weighted >> compared to ring buffer approach? If a fd has bpf attached, nhit/nmisses >> means the kprobe is processed by bpf program or not. > > There is nothing kprobe specific here. Kprobes just appear to be the > only one actually accounting the recursion cases, but everyone has > them. Sorry to be specific, kprobe is just an example, I actually refers to any perf event where bpf can attach to, which theoretically are any perf events which can be opened with "perf_event_open" syscall although some of them (e.g., software events?) may not have bpf running hooks yet. > >> Currently, for debugfs, the nhit/nmisses info is exposed at >> {k|u}probe_profile. Alternative, we could expose the nhit/nmisses >> in /proc/self/fdinfo/<fd>. User can query this interface to >> get numbers. > > No, we're not adding stuff to procfs for this. No problem. Just a suggestion.