On Tue, Jul 18, 2023 at 05:59:53PM -0700, Alexei Starovoitov wrote: > On Mon, Jul 17, 2023 at 4:17 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote: > > > > The nesting protection in bpf_perf_event_output relies on disabled > > preemption, which is guaranteed for kprobes and tracepoints. > > I don't understand why you came up with such a conclusion. > bpf_perf_event_output needs migration disabled and doesn't mind > being preempted. > That's what the nested counter is for. > > Stack trace also doesn't look like it's related to that. > More like stack corruption in perf_output_sample. hum I think that with the preemption being enabled following scenario is possible where at the end 2 tasks on same cpu can endup sharing same pointer to struct perf_sample_data: task-1 -------------------------------------------------------- uprobe hit uprobe_dispatcher { __uprobe_perf_func bpf_prog_run_array_sleepable { might_fault rcu_read_lock_trace migrate_disable rcu_read_lock bpf_prog ... bpf_perf_event_output { nest_level = bpf_trace_nest_level = 1 sd = &sds->sds[0]; -> preempted by task-2 task-2 -------------------------------------------------------- uprobe hit uprobe_dispatcher __uprobe_perf_func bpf_prog_run_array_sleepable might_fault rcu_read_lock_trace migrate_disable rcu_read_lock bpf_prog ... bpf_perf_event_output { nest_level = bpf_trace_nest_level = 2 sd = &sds->sds[1]; -> preempted by task-1 __bpf_perf_event_output(regs, map, flags, sd); perf_output_sample(data) bpf_trace_nest_level = 1 } /* bpf_perf_event_output */ rcu_read_unlock migrate_enable rcu_read_unlock_trace } /* bpf_prog_run_array_sleepable */ } /* uprobe_dispatcher */ uprobe hit uprobe_dispatcher { __uprobe_perf_func bpf_prog_run_array_sleepable { might_fault rcu_read_lock_trace migrate_disable rcu_read_lock bpf_prog ... bpf_perf_event_output { nest_level = bpf_trace_nest_level = 2 sd = &sds->sds[1]; now task-1 and task-2 share same bpf_trace_nest_level value and same 'struct perf_sample_data' buffer on top of &sds->sds[1] I did not figure out yet the actual exact scenario/cause of the crash yet, I suspect one of the tasks copies data over some boundary, but all the ideas I had so far did not match the instructions from the crash anyway I thought that having 2 tasks sharing the same perf_sample_data is bad enough to send the patch > > Do you have > commit eb81a2ed4f52 ("perf/core: Fix perf_output_begin parameter is > incorrectly invoked in perf_event_bpf_output") > in your kernel? yes, I just retested and see that on latest bpf-next/master jirka > > > However bpf_perf_event_output can be also called from uprobes context > > through bpf_prog_run_array_sleepable function which disables migration, > > but keeps preemption enabled. > > > > This can cause task to be preempted by another one inside the nesting > > protection and lead eventually to two tasks using same perf_sample_data > > buffer and cause crashes like: > > > > kernel tried to execute NX-protected page - exploit attempt? (uid: 0) > > BUG: unable to handle page fault for address: ffffffff82be3eea > > ... > > Call Trace: > > ? __die+0x1f/0x70 > > ? page_fault_oops+0x176/0x4d0 > > ? exc_page_fault+0x132/0x230 > > ? asm_exc_page_fault+0x22/0x30 > > ? perf_output_sample+0x12b/0x910 > > ? perf_event_output+0xd0/0x1d0 > > ? bpf_perf_event_output+0x162/0x1d0 > > ? bpf_prog_c6271286d9a4c938_krava1+0x76/0x87 > > ? __uprobe_perf_func+0x12b/0x540 > > ? uprobe_dispatcher+0x2c4/0x430 > > ? uprobe_notify_resume+0x2da/0xce0 > > ? atomic_notifier_call_chain+0x7b/0x110 > > ? exit_to_user_mode_prepare+0x13e/0x290 > > ? irqentry_exit_to_user_mode+0x5/0x30 > > ? asm_exc_int3+0x35/0x40 > > > > Fixing this by disabling preemption in bpf_perf_event_output. > > > > Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data") > > Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx> > > --- > > kernel/trace/bpf_trace.c | 11 ++++++++--- > > 1 file changed, 8 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > index c92eb8c6ff08..2a6ba05d8aee 100644 > > --- a/kernel/trace/bpf_trace.c > > +++ b/kernel/trace/bpf_trace.c > > @@ -661,8 +661,7 @@ static DEFINE_PER_CPU(int, bpf_trace_nest_level); > > BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map, > > u64, flags, void *, data, u64, size) > > { > > - struct bpf_trace_sample_data *sds = this_cpu_ptr(&bpf_trace_sds); > > - int nest_level = this_cpu_inc_return(bpf_trace_nest_level); > > + struct bpf_trace_sample_data *sds; > > struct perf_raw_record raw = { > > .frag = { > > .size = size, > > @@ -670,7 +669,12 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map, > > }, > > }; > > struct perf_sample_data *sd; > > - int err; > > + int nest_level, err; > > + > > + preempt_disable(); > > + > > + sds = this_cpu_ptr(&bpf_trace_sds); > > + nest_level = this_cpu_inc_return(bpf_trace_nest_level); > > > > if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(sds->sds))) { > > err = -EBUSY; > > @@ -691,6 +695,7 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map, > > > > out: > > this_cpu_dec(bpf_trace_nest_level); > > + preempt_enable(); > > return err; > > } > > > > -- > > 2.41.0 > >