Re: [PATCH] bpf: fix rq lock recursion issue

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 13 Jun 2022 14:01:46 -0700

On Mon, Jun 13, 2022 at 10:47 AM Satya Durga Srinivasu Prabhala
<quic_satyap@xxxxxxxxxxx> wrote:
>
>
> On 6/13/22 9:35 AM, Yonghong Song wrote:
> >
> >
> > On 6/13/22 2:22 AM, Toke Høiland-Jørgensen wrote:
> >> !-------------------------------------------------------------------|
> >>    This Message Is From an External Sender
> >>    This message came from outside your organization.
> >> |-------------------------------------------------------------------!
> >>
> >> Satya Durga Srinivasu Prabhala <quic_satyap@xxxxxxxxxxx> writes:
> >>
> >>> Below recursion is observed in a rare scenario where __schedule()
> >>> takes rq lock, at around same time task's affinity is being changed,
> >>> bpf function for tracing sched_switch calls migrate_enabled(),
> >>> checks for affinity change (cpus_ptr != cpus_mask) lands into
> >>> __set_cpus_allowed_ptr which tries acquire rq lock and causing the
> >>> recursion bug.
> >>
> >> So this only affects tracing programs that attach to tasks that can have
> >> their affinity changed? Or do we need to review migrate_enable() vs
> >> preempt_enable() for networking hooks as well?
> >
> > I think that changing from migrate_disable() to preempt_disable()
> > won't work from RT kernel. In fact, the original preempt_disable() to
> > migrate_disable() is triggered by RT kernel discussion.
> >
> > As you mentioned, this is a very special case. Not sure whether we have
> > a good way to fix it or not. Is it possible we can check whether rq lock
> > is held or not under condition cpus_ptr != cpus_mask? If it is,
> > migrate_disable() (or a variant of it) should return an error code
> > to indicate it won't work?
>
> That essentially becomes using preempt_enable/disable().
> If we need migrate_enable/disable() for RT kernels, we can
> add specific check for RT Kernels like below which should fix
> issue for non-RT Kernels?
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 2b914a56a2c5..ec1a287dbf5e 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1414,7 +1414,11 @@ bpf_prog_run_array(const struct bpf_prog_array
> *array,
>          if (unlikely(!array))
>                  return ret;
>
> +#ifdef CONFIG_PREEMPT_RT
>          migrate_disable();
> +#else
> +       preempt_disable();
> +#endif
>          old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
>          item = &array->items[0];
>          while ((prog = READ_ONCE(item->prog))) {
> @@ -1423,7 +1427,11 @@ bpf_prog_run_array(const struct bpf_prog_array
> *array,
>                  item++;
>          }
>          bpf_reset_run_ctx(old_run_ctx);
> +#ifdef CONFIG_PREEMPT_RT
>          migrate_enable();
> +#else
> +       preempt_enable();
> +#endif

This doesn't solve anything.
Please provide a reproducer.
iirc the task's affinity change can race even with preemption disabled
on this cpu. Why would s/migrate/preemption/ address the deadlock ?
Once there is a reproducer we need to figure out a different way
of addressing it. Maybe we will special case trace_sched_switch.