On Thu, Jun 13, 2019 at 5:52 PM Matt Mullins <mmullins@xxxxxx> wrote: > > On Fri, 2019-06-14 at 00:47 +0200, Daniel Borkmann wrote: > > On 06/12/2019 07:00 AM, Andrii Nakryiko wrote: > > > On Tue, Jun 11, 2019 at 8:48 PM Matt Mullins <mmullins@xxxxxx> wrote: > > > > > > > > BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as > > > > they do not increment bpf_prog_active while executing. > > > > > > > > This enables three levels of nesting, to support > > > > - a kprobe or raw tp or perf event, > > > > - another one of the above that irq context happens to call, and > > > > - another one in nmi context > > > > (at most one of which may be a kprobe or perf event). > > > > > > > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data") > > > > Generally, looks good to me. Two things below: > > > > Nit, for stable, shouldn't fixes tag be c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT") > > instead of the one you currently have? > > Ah, yeah, that's probably more reasonable; I haven't managed to come up > with a scenario where one could hit this without raw tracepoints. I'll > fix up the nits that've accumulated since v2. > > > One more question / clarification: we have __bpf_trace_run() vs trace_call_bpf(). > > > > Only raw tracepoints can be nested since the rest has the bpf_prog_active per-CPU > > counter via trace_call_bpf() and would bail out otherwise, iiuc. And raw ones use > > the __bpf_trace_run() added in c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT"). > > > > 1) I tried to recall and find a rationale for mentioned trace_call_bpf() split in > > the c4f6699dfcb8 log, but couldn't find any. Is the raison d'être purely because of > > performance overhead (and desire to not miss events as a result of nesting)? (This > > also means we're not protected by bpf_prog_active in all the map ops, of course.) > > 2) Wouldn't this also mean that we only need to fix the raw tp programs via > > get_bpf_raw_tp_regs() / put_bpf_raw_tp_regs() and won't need this duplication for > > the rest which relies upon trace_call_bpf()? I'm probably missing something, but > > given they have separate pt_regs there, how could they be affected then? > > For the pt_regs, you're correct: I only used get/put_raw_tp_regs for > the _raw_tp variants. However, consider the following nesting: > > trace_nest_level raw_tp_nest_level > (kprobe) bpf_perf_event_output 1 0 > (raw_tp) bpf_perf_event_output_raw_tp 2 1 > (raw_tp) bpf_get_stackid_raw_tp 2 2 > > I need to increment a nest level (and ideally increment it only once) > between the kprobe and the first raw_tp, because they would otherwise > share the struct perf_sample_data. But I also need to increment a nest > level between the two raw_tps, since they share the pt_regs -- I can't > use trace_nest_level for everything because it's not used by > get_stackid, and I can't use raw_tp_nest_level for everything because > it's not incremented by kprobes. > > If raw tracepoints were to bump bpf_prog_active, then I could get away > with just using that count in these callsites -- I'm reluctant to do > that, though, since it would prevent kprobes from ever running inside a > raw_tp. I'd like to retain the ability to (e.g.) > trace.py -K htab_map_update_elem > and get some stack traces from at least within raw tracepoints. > > That said, as I wrote up this example, bpf_trace_nest_level seems to be > wildly misnamed; I should name those after the structure they're > protecting... I still don't get what's wrong with the previous approach. Didn't I manage to convince both of you that perf_sample_data inside bpf_perf_event_array doesn't have any issue that Daniel brought up? I think this refcnting approach is inferior.