Re: [PATCH 1/2] bpf: Disable preemption in bpf_perf_event_output

Jiri Olsa <olsajiri@xxxxxxxxx> · Wed, 19 Jul 2023 16:08:38 +0200



On Tue, Jul 18, 2023 at 05:59:53PM -0700, Alexei Starovoitov wrote:
> On Mon, Jul 17, 2023 at 4:17 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
> >
> > The nesting protection in bpf_perf_event_output relies on disabled
> > preemption, which is guaranteed for kprobes and tracepoints.
> 
> I don't understand why you came up with such a conclusion.
> bpf_perf_event_output needs migration disabled and doesn't mind
> being preempted.
> That's what the nested counter is for.
> 
> Stack trace also doesn't look like it's related to that.
> More like stack corruption in perf_output_sample.

hum I think that with the preemption being enabled following scenario
is possible where at the end 2 tasks on same cpu can endup sharing same
pointer to struct perf_sample_data:


        task-1
        --------------------------------------------------------
        uprobe hit

        uprobe_dispatcher
        {
          __uprobe_perf_func
            bpf_prog_run_array_sleepable
            {
              might_fault
              rcu_read_lock_trace
              migrate_disable
              rcu_read_lock

              bpf_prog
                ...
                bpf_perf_event_output
                {
                  nest_level = bpf_trace_nest_level = 1
                  sd = &sds->sds[0];

        -> preempted by task-2


                                                                        task-2
                                                                        --------------------------------------------------------
                                                                        uprobe hit

                                                                        uprobe_dispatcher
                                                                          __uprobe_perf_func
                                                                            bpf_prog_run_array_sleepable

                                                                              might_fault
                                                                              rcu_read_lock_trace
                                                                              migrate_disable
                                                                              rcu_read_lock

                                                                              bpf_prog
                                                                                ...
                                                                                bpf_perf_event_output
                                                                                {
                                                                                  nest_level = bpf_trace_nest_level = 2
                                                                                  sd = &sds->sds[1];

                                                                        -> preempted by task-1


                  __bpf_perf_event_output(regs, map, flags, sd);
                    perf_output_sample(data)

                  bpf_trace_nest_level = 1

                } /* bpf_perf_event_output */

              rcu_read_unlock
              migrate_enable
              rcu_read_unlock_trace

            } /* bpf_prog_run_array_sleepable */
        } /* uprobe_dispatcher */

        uprobe hit

        uprobe_dispatcher
        {
          __uprobe_perf_func
            bpf_prog_run_array_sleepable
            {
              might_fault
              rcu_read_lock_trace
              migrate_disable
              rcu_read_lock

              bpf_prog
                ...
                bpf_perf_event_output {
                  nest_level = bpf_trace_nest_level = 2
                  sd = &sds->sds[1];


now task-1 and task-2 share same bpf_trace_nest_level value and same
'struct perf_sample_data' buffer on top of &sds->sds[1]

I did not figure out yet the actual exact scenario/cause of the crash yet,
I suspect one of the tasks copies data over some boundary, but all the
ideas I had so far did not match the instructions from the crash

anyway I thought that having 2 tasks sharing the same perf_sample_data
is bad enough to send the patch

> 
> Do you have
> commit eb81a2ed4f52 ("perf/core: Fix perf_output_begin parameter is
> incorrectly invoked in perf_event_bpf_output")
> in your kernel?

yes, I just retested and see that on latest bpf-next/master

jirka

> 
> > However bpf_perf_event_output can be also called from uprobes context
> > through bpf_prog_run_array_sleepable function which disables migration,
> > but keeps preemption enabled.
> >
> > This can cause task to be preempted by another one inside the nesting
> > protection and lead eventually to two tasks using same perf_sample_data
> > buffer and cause crashes like:
> >
> >   kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
> >   BUG: unable to handle page fault for address: ffffffff82be3eea
> >   ...
> >   Call Trace:
> >    ? __die+0x1f/0x70
> >    ? page_fault_oops+0x176/0x4d0
> >    ? exc_page_fault+0x132/0x230
> >    ? asm_exc_page_fault+0x22/0x30
> >    ? perf_output_sample+0x12b/0x910
> >    ? perf_event_output+0xd0/0x1d0
> >    ? bpf_perf_event_output+0x162/0x1d0
> >    ? bpf_prog_c6271286d9a4c938_krava1+0x76/0x87
> >    ? __uprobe_perf_func+0x12b/0x540
> >    ? uprobe_dispatcher+0x2c4/0x430
> >    ? uprobe_notify_resume+0x2da/0xce0
> >    ? atomic_notifier_call_chain+0x7b/0x110
> >    ? exit_to_user_mode_prepare+0x13e/0x290
> >    ? irqentry_exit_to_user_mode+0x5/0x30
> >    ? asm_exc_int3+0x35/0x40
> >
> > Fixing this by disabling preemption in bpf_perf_event_output.
> >
> > Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data")
> > Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx>
> > ---
> >  kernel/trace/bpf_trace.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index c92eb8c6ff08..2a6ba05d8aee 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -661,8 +661,7 @@ static DEFINE_PER_CPU(int, bpf_trace_nest_level);
> >  BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
> >            u64, flags, void *, data, u64, size)
> >  {
> > -       struct bpf_trace_sample_data *sds = this_cpu_ptr(&bpf_trace_sds);
> > -       int nest_level = this_cpu_inc_return(bpf_trace_nest_level);
> > +       struct bpf_trace_sample_data *sds;
> >         struct perf_raw_record raw = {
> >                 .frag = {
> >                         .size = size,
> > @@ -670,7 +669,12 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
> >                 },
> >         };
> >         struct perf_sample_data *sd;
> > -       int err;
> > +       int nest_level, err;
> > +
> > +       preempt_disable();
> > +
> > +       sds = this_cpu_ptr(&bpf_trace_sds);
> > +       nest_level = this_cpu_inc_return(bpf_trace_nest_level);
> >
> >         if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(sds->sds))) {
> >                 err = -EBUSY;
> > @@ -691,6 +695,7 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map,
> >
> >  out:
> >         this_cpu_dec(bpf_trace_nest_level);
> > +       preempt_enable();
> >         return err;
> >  }
> >
> > --
> > 2.41.0
> >