On Mon, Jul 15, 2024 at 8:04 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Mon, Jul 15, 2024 at 07:33:57AM -0700, Kyle Huey wrote: > > On Mon, Jul 15, 2024 at 4:12 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > Urgh, so wth does event_is_tracing do with event->prog? And can't we > > > clean this up? > > > > Tracing events keep track of the bpf program in event->prog solely for > > cleanup. The bpf programs are stored in and invoked from > > event->tp_event->prog_array, but when the event is destroyed it needs > > to know which bpf program to remove from that array. > > Yeah, figured it out eventually.. Does look like it needs event->prog > and we can't easily remedy this dual use :/ > > > > That whole perf_event_is_tracing() is a pretty gross function. > > > > > > Also, I think the default return value of bpf_overflow_handler() is > > > wrong -- note how if !event->prog we won't call bpf_overflow_handler(), > > > but if we do call it, but then have !event->prog on the re-read, we > > > still return 0. > > > > The synchronization model here isn't quite clear to me but I don't > > think this matters in practice. Once event->prog is set the only > > allowed change is for it to be cleared when the perf event is freed. > > Anything else is refused by perf_event_set_bpf_handler() with EEXIST. > > Can that free race with an overflow handler? I'm not sure, but even if > > it can, dropping an overflow for an event that's being freed seems > > fine to me. If it can't race then we could remove the condition on the > > re-read entirely. > > Right, also rcu_read_lock() is cheap enough to unconditionally do I'm > thinking. > > So since we have two distinct users of event->prog, I figured we could > distinguish them from one of the LSB in the pointer value, which then > got me the below. > > But now that I see the end result I'm not at all sure this is sane. > > But I figure it ought to work... I think this would probably work but stealing the bit seems far more complicated than just gating on perf_event_is_tracing(). Would it assuage your concerns at all if I made event->prog a simple union between say handler_prog and sample_prog (still discriminated by perf_event_is_tracing() where necessary) with appropriate comments and changed the two code paths accordingly? - Kyle > --- > diff --git a/kernel/events/core.c b/kernel/events/core.c > index ab6c4c942f79..5ec78346c2a1 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -9594,6 +9594,13 @@ static inline bool sample_is_allowed(struct perf_event *event, struct pt_regs *r > } > > #ifdef CONFIG_BPF_SYSCALL > + > +static inline struct bpf_prog *event_prog(struct perf_event *event) > +{ > + unsigned long _prog = (unsigned long)READ_ONCE(event->prog); > + return (void *)(_prog & ~1); > +} > + > static int bpf_overflow_handler(struct perf_event *event, > struct perf_sample_data *data, > struct pt_regs *regs) > @@ -9603,19 +9610,21 @@ static int bpf_overflow_handler(struct perf_event *event, > .event = event, > }; > struct bpf_prog *prog; > - int ret = 0; > + int ret = 1; > + > + guard(rcu)(); > > - ctx.regs = perf_arch_bpf_user_pt_regs(regs); > - if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) > - goto out; > - rcu_read_lock(); > prog = READ_ONCE(event->prog); > - if (prog) { > + if (!((unsigned long)prog & 1)) > + return ret; > + > + prog = (void *)((unsigned long)prog & ~1); > + > + if (unlikely(__this_cpu_inc_return(bpf_prog_active) == 1)) { > perf_prepare_sample(data, event, regs); > + ctx.regs = perf_arch_bpf_user_pt_regs(regs); > ret = bpf_prog_run(prog, &ctx); > } > - rcu_read_unlock(); > -out: > __this_cpu_dec(bpf_prog_active); > > return ret; > @@ -9652,14 +9661,14 @@ static inline int perf_event_set_bpf_handler(struct perf_event *event, > return -EPROTO; > } > > - event->prog = prog; > + event->prog = (void *)((unsigned long)prog | 1); > event->bpf_cookie = bpf_cookie; > return 0; > } > > static inline void perf_event_free_bpf_handler(struct perf_event *event) > { > - struct bpf_prog *prog = event->prog; > + struct bpf_prog *prog = event_prog(event); > > if (!prog) > return; > @@ -9707,7 +9716,7 @@ static int __perf_event_overflow(struct perf_event *event, > > ret = __perf_event_account_interrupt(event, throttle); > > - if (event->prog && !bpf_overflow_handler(event, data, regs)) > + if (!bpf_overflow_handler(event, data, regs)) > return ret; > > /* > @@ -12026,10 +12035,10 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, > context = parent_event->overflow_handler_context; > #if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_EVENT_TRACING) > if (parent_event->prog) { > - struct bpf_prog *prog = parent_event->prog; > + struct bpf_prog *prog = event_prog(parent_event); > > bpf_prog_inc(prog); > - event->prog = prog; > + event->prog = parent_event->prog; > } > #endif > }