On Mon, Jul 17, 2023 at 4:18 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote: > > We received report [1] of kernel crash, which is caused by > using nesting protection without disabled preemption. Same question. I don't see why it would be preemption related. > The bpf_event_output can be called by programs executed by > bpf_prog_run_array_cg function that disabled migration but > keeps preemption enabled. > > This can cause task to be preempted by another one inside the > nesting protection and lead eventually to two tasks using same > perf_sample_data buffer and cause crashes like: > > BUG: kernel NULL pointer dereference, address: 0000000000000001 > #PF: supervisor instruction fetch in kernel mode > #PF: error_code(0x0010) - not-present page > ... > ? perf_output_sample+0x12a/0x9a0 > ? finish_task_switch.isra.0+0x81/0x280 > ? perf_event_output+0x66/0xa0 > ? bpf_event_output+0x13a/0x190 > ? bpf_event_output_data+0x22/0x40 > ? bpf_prog_dfc84bbde731b257_cil_sock4_connect+0x40a/0xacb > ? xa_load+0x87/0xe0 > ? __cgroup_bpf_run_filter_sock_addr+0xc1/0x1a0 > ? release_sock+0x3e/0x90 > ? sk_setsockopt+0x1a1/0x12f0 > ? udp_pre_connect+0x36/0x50 > ? inet_dgram_connect+0x93/0xa0 > ? __sys_connect+0xb4/0xe0 > ? udp_setsockopt+0x27/0x40 > ? __pfx_udp_push_pending_frames+0x10/0x10 > ? __sys_setsockopt+0xdf/0x1a0 > ? __x64_sys_connect+0xf/0x20 > ? do_syscall_64+0x3a/0x90 > ? entry_SYSCALL_64_after_hwframe+0x72/0xdc > > Fixing this by disabling preemption in bpf_event_output. > > [1] https://github.com/cilium/cilium/issues/26756 > Reported-by: Oleg "livelace" Popov <o.popov@xxxxxxxxxxx> > Fixes: 768fb61fcc13 ("bpf: Fix bpf_event_output re-entry issue") > Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx> > --- > kernel/trace/bpf_trace.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 2a6ba05d8aee..36fb6e483952 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -720,7 +720,6 @@ static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_misc_sds); > u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, > void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy) > { > - int nest_level = this_cpu_inc_return(bpf_event_output_nest_level); > struct perf_raw_frag frag = { > .copy = ctx_copy, > .size = ctx_size, > @@ -737,8 +736,13 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, > }; > struct perf_sample_data *sd; > struct pt_regs *regs; > + int nest_level; > u64 ret; > > + preempt_disable(); > + > + nest_level = this_cpu_inc_return(bpf_event_output_nest_level); > + > if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(bpf_misc_sds.sds))) { > ret = -EBUSY; > goto out; > @@ -753,6 +757,7 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, > ret = __bpf_perf_event_output(regs, map, flags, sd); > out: > this_cpu_dec(bpf_event_output_nest_level); > + preempt_enable(); > return ret; > } > > -- > 2.41.0 >