Re: [PATCH RFC bpf-next v1 1/2] bpf: Fix deadlock for bpf_timer's spinlock

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Sun, 6 Nov 2022 18:34:03 -0800

On Sun, Nov 6, 2022 at 5:48 PM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
>
> On Mon, Nov 07, 2022 at 06:01:44AM IST, Alexei Starovoitov wrote:
> > On Sun, Nov 6, 2022 at 1:44 PM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
> > >
> > > On Mon, Nov 07, 2022 at 02:50:08AM IST, Alexei Starovoitov wrote:
> > > > On Sat, Nov 5, 2022 at 6:52 PM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
> > > > >
> > > > > Currently, unlike other tracing program types, BPF_PROG_TYPE_TRACING is
> > > > > excluded is_tracing_prog_type checks. This means that they can use maps
> > > > > containing bpf_spin_lock, bpf_timer, etc. without verification failure.
> > > > >
> > > > > However, allowing fentry/fexit programs to use maps that have such
> > > > > bpf_timer in the map value can lead to deadlock.
> > > > >
> > > > > Suppose that an fentry program is attached to bpf_prog_put, and a TC
> > > > > program executes and does bpf_map_update_elem on an array map that both
> > > > > progs share. If the fentry programs calls bpf_map_update_elem for the
> > > > > same key, it will lead to acquiring of the same lock from within the
> > > > > critical section protecting the timer.
> > > > >
> > > > > The call chain is:
> > > > >
> > > > > bpf_prog_test_run_opts() // TC
> > > > >   bpf_prog_TC
> > > > >     bpf_map_update_elem(array_map, key=0)
> > > > >       bpf_obj_free_fields
> > > > >         bpf_timer_cancel_and_free
> > > > >           __bpf_spin_lock_irqsave
> > > > >             drop_prog_refcnt
> > > > >               bpf_prog_put
> > > > >                 bpf_prog_FENTRY // FENTRY
> > > > >                   bpf_map_update_elem(array_map, key=0)
> > > > >                     bpf_obj_free_fields
> > > > >                       bpf_timer_cancel_and_free
> > > > >                         __bpf_spin_lock_irqsave // DEADLOCK
> > > > >
> > > > > BPF_TRACE_ITER attach type can be excluded because it always executes in
> > > > > process context.
> > > > >
> > > > > Update selftests using bpf_timer in fentry to TC as they will be broken
> > > > > by this change.
> > > >
> > > > which is an obvious red flag and the reason why we cannot do
> > > > this change.
> > > > This specific issue could be addressed with addition of
> > > > notrace in drop_prog_refcnt, bpf_prog_put, __bpf_prog_put.
> > > > Other calls from __bpf_prog_put can stay as-is,
> > > > since they won't be called in this convoluted case.
> > > > I frankly don't get why you're spending time digging such
> > > > odd corner cases that no one can hit in real use.
> > >
> > > I was trying to figure out whether bpf_list_head_free would be safe to call all
> > > the time in map updates from bpf_obj_free_fields, since it takes the very same
> > > spin lock that BPF program can also take to update the list.
> > >
> > > Map update ops are not allowed in the critical section, so this particular kind
> > > of recurisve map update call should not be possible. perf event is already
> > > prevented using is_tracing_prog_type, so NMI prog cannot interrupt and update
> > > the same map.
> > >
> > > But then I went looking whether it was a problem elsewhere...
> > >
> > > FWIW I have updated my patch to do:
> > >
> > >   if (btf_record_has_field(map->record, BPF_LIST_HEAD)) { ‣rec: map->record ‣type: BPF_LIST_HEAD
> > >         if (is_tracing_prog_type(prog_type) || ‣type: prog_type
> > >             (prog_type == BPF_PROG_TYPE_TRACING &&
> > >              env->prog->expected_attach_type != BPF_TRACE_ITER)) {
> > >                 verbose(env, "tracing progs cannot use bpf_list_head yet\n"); ‣private_data: env ‣fmt: "tracing progs cannot use bp
> > >                 return -EINVAL;
> > >         }
> > >   }
> >
> > That is a severe limitation.
> > Why cannot you use notrace approach?
>
> Yes, notrace is indeed an option, but the problem is that everything within that
> critical section needs to be notrace. bpf_list_head_free also ends up calling
> bpf_obj_free_fields again (the level of recursion however won't exceed 3, since
> we clamp list_head -> list_head till 2 levels).
>
> So the notrace needs to be applied to everything within it, which is not a
> problem now. It can be done.

let's do it then.

> BPF_TIMER and BPF_KPTR_REF (the indirect call to
> dtor) won't be triggered, so it probably just needs to be bpf_list_head_free
> and bpf_obj_free_fields.
>
> But it can break silently in the future, if e.g. kptr is allowed. Same for
> drop_prog_refcnt if something changes. Every change to anything they call (and
> called by functions they call) needs to keep the restriction in mind.

Only funcs that can realistically be called.
Not every function that is in there.

> I was wondering if in both cases of bpf_timer and bpf_list_head, we can simply
> swap out the object locally while holding the lock, and then do everything
> outside the spin lock.

Maybe, but I wouldn't bother optimizing for convoluted cases.
Use notrace when you can.
Everything that bpf_mem_alloc is doing is notrace for the same reason.

>
> For bpf_timer, it would mean moving drop_prog_refcnt outside spin lock critical
> section. hrtimer_cancel is already done after the unlock. For bpf_list_head, it
> would mean swapping out the list_head and then draining it outside the lock.

That also works.
drop_prog_refcnt() can be moved after unlock.
Don't see any race.

> Then we hopefully don't need to use notrace, and it wouldn't be possible for any
> tracing prog to execute while we hold the bpf_spin_lock (unless I missed
> something).

yep. spin_lock, link list, obj_new won't be allowed in
is_tracing_prog_type().
Only talking about prog_type_tracing, since those are
a lot more than tracing. Everything new falls into this category.
We might even create an alias like prog_type_generic to highlight that.