Re: [PATCH bpf-next v2] Add notrace to queued_spin_lock_slowpath

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 22 Apr 2024 16:47:46 -0700

On Mon, Apr 22, 2024 at 4:20 PM Siddharth Chintamaneni
<sidchintamaneni@xxxxxxxxx> wrote:
>
> On Mon, 22 Apr 2024 at 13:13, Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Mon, Apr 22, 2024 at 12:47 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> > >
> > > On Sun, Apr 21, 2024 at 07:43:36PM -0400, Siddharth Chintamaneni wrote:
> > > > This patch is to prevent deadlocks when multiple bpf
> > > > programs are attached to queued_spin_locks functions. This issue is similar
> > > > to what is already discussed[1] before with the spin_lock helpers.
> > > >
> > > > The addition of notrace macro to the queued_spin_locks
> > > > has been discussed[2] when bpf_spin_locks are introduced.
> > > >
> > > > [1] https://lore.kernel.org/bpf/CAE5sdEigPnoGrzN8WU7Tx-h-iFuMZgW06qp0KHWtpvoXxf1OAQ@xxxxxxxxxxxxxx/#r
> > > > [2] https://lore.kernel.org/all/20190117011629.efxp7abj4bpf5yco@ast-mbp/t/#maf05c4d71f935f3123013b7ed410e4f50e9da82c
> > > >
> > > > Fixes: d83525ca62cf ("bpf: introduce bpf_spin_lock")
> > > > Signed-off-by: Siddharth Chintamaneni <sidchintamaneni@xxxxxx>
> > > > ---
> > > >  kernel/locking/qspinlock.c                    |  2 +-
> > > >  .../bpf/prog_tests/tracing_failure.c          | 24 +++++++++++++++++++
> > > >  .../selftests/bpf/progs/tracing_failure.c     |  6 +++++
> > > >  3 files changed, 31 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > > > index ebe6b8ec7cb3..4d46538d8399 100644
> > > > --- a/kernel/locking/qspinlock.c
> > > > +++ b/kernel/locking/qspinlock.c
> > > > @@ -313,7 +313,7 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
> > > >   * contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
> > > >   *   queue               :         ^--'                             :
> > > >   */
> > > > -void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > > > +notrace void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > >
> > > we did the same for bpf spin lock helpers, which is fine, but I wonder
> > > removing queued_spin_lock_slowpath from traceable functions could break
> > > some scripts (even though many probably use contention tracepoints..)
> > >
> > > maybe we could have a list of helpers/kfuncs that could call spin lock
> > > and deny bpf program to load/attach to queued_spin_lock_slowpath
> > > if it calls anything from that list
> >
> > We can filter out many such functions, but the possibility of deadlock
> > will still exist.
> > Adding notrace here won't help much,
> > since there are tracepoints in there: trace_contention_begin/end
> > which are quite useful and we should still allow bpf to use them.
> > I think the only bullet proof way is to detect deadlocks at runtime.
> > I'm working on such "try hard to spin_lock but abort if it deadlocks."
>
> I agree with the point that notracing all the functions will not
> resolve the issue. I could also find a scenario where BPF programs
> will end up in a deadlock easily by using bpf_map_pop_elem and
> bpf_map_push_elem helper functions called from two different BPF
> programs accessing the same map. Here are some issues raised by syzbot
> [2, 3].

ringbuf and stackqueue maps should probably be fixed now
similar to hashmap's __this_cpu_inc_return(*(htab->map_locked...)
approach.
Both ringbug and queue_stack can handle failure to lock.
That will address the issue spotted by these 2 syzbot reports.
Could you work on such patches?

The full run-time solution will take time to land and
may be too big to be backportable.
I'll certainly cc you on the patches when I send them.

>  I also believe that a BPF program can end up in a deadlock scenario
> without any assistance from the second BPF program, like described
> above. The runtime solution sounds like a better fit to address this
> problem, unless there is a BPF program that should definitely run for
> the performance or security of the system (like an LSM hook or a
> nested scheduling type program as mentioned here [1]).

Right. Certain bpf progs like tcp-bpf don't even have a recursion
run-time counter, because they have to nest and it's safe to nest.

> In those cases, the user assumes that these BPF programs will always
> trigger. So, to address these types of issues, we are currently
> working on a helper's function callgraph based approach so that the
> verifier gets the ability to make a decision during load time on
> whether to load it or not, ensuring that if a BPF program is attached,
> it will be triggered.

callgraph approach? Could you share more details?

>
> [1] https://lore.kernel.org/all/a15f6a20-c902-4057-a1a9-8259a225bb8b@xxxxxxxxx/
> [2] https://lore.kernel.org/lkml/0000000000004aa700061379547e@xxxxxxxxxx/
> [3] https://lore.kernel.org/lkml/0000000000004c3fc90615f37756@xxxxxxxxxx/
>
> PS. We are following a similar approach to solve the stackoverflow
> problem with nesting.

Not following. The stack overflow issue is being fixed by not using
the kernel stack. So each bpf prog will consume a tiny bit of stack
to save frame pointer, return address, and callee regs.