Re: [RFC PATCH] tracing: Fix syscall tracepoint use-after-free

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 24 Oct 2024 10:12:25 -0700

On Wed, Oct 23, 2024 at 7:05 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Wed, 23 Oct 2024 11:19:40 -0400
> Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> >
> > > Looks like Mathieu patch broke bpf program contract somewhere.
> >
> > My patch series introduced this in the probe:
> >
> > #define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)                  \
> > static notrace void                                                     \
> > __bpf_trace_##call(void *__data, proto)                                 \
> > {                                                                       \
> >          might_fault();                                                  \
> >          preempt_disable_notrace();                                      \
>
> Is the problem that we can call this function *after* the prog has been
> freed? That is, the preempt_disable_notrace() here is meaningless.
>

Yes, I think so.

> Is there a way to add something here to make sure the program is still
> valid? Like set a flag in the link structure?

So I think a big assumption right now is that waiting for RCU grace
period is enough to make sure that BPF link (and thus its underlying
BPF program) are not referenced from that tracepoint anymore, and so
we can proceed with freeing the memory.

Now that some tracepoints are sleepable and are RCU Tasks Trace
protected, this assumption is wrong.

One solution might be to teach BPF raw tracepoint link to recognize
sleepable tracepoints, and then go through cal_rcu_task_trace ->
call_rcu chain instead of normal call_rcu. Similarly, for such cases
we'd need to do the same chain for underlying BPF program, even if BPF
program itself is not sleepable.

Alternatively, we'd need to add synchronize_rcu() +
synchronize_rcu_task_trace() somewhere inside
tracepoint_probe_unregister() or bpf_probe_unregister(), which is just
a thin wrapper around the former. Which would make detaching multiple
tracepoints extremely slow (just like the problem we had with kprobe
detachment before multi-kprobes were added).

The fundamental reason for this is how we do lifetime tracking between
tracepoint object and bpf_link/bpf_program. tracepoint doesn't hold a
refcount on bpf_link. It's rather that when bpf_link's last refcount
drops to zero, we go and do this:

static void bpf_raw_tp_link_release(struct bpf_link *link)
{
        struct bpf_raw_tp_link *raw_tp =
                container_of(link, struct bpf_raw_tp_link, link);

        bpf_probe_unregister(raw_tp->btp, raw_tp);
        bpf_put_raw_tracepoint(raw_tp->btp);
}

And the assumption is that after bpf_probe_unregister() it should be
fine to free BPF link and program after call_rcu(). So we either make
bpf_probe_unregister() synchronously wait for
synchronize_rcu[_task_trace](), or we make sure to not free link/prog
until call_rcu_tasks_trace->call_rcu.

I'd prefer the former (add call_rcu_tasks_trace for sleepable BPF raw
tracepoint link).

You guys said you have a reproducer, right? Can you please share
details (I know it's somewhere on another thread, but let's put all
this in this thread). And as Alexei asked, where are the applied
patches adding faultable tracepoints?

>
> (I don't know how BPF works well enough to know what is involved here,
> so excuse me if this is totally off)

it's not off, but I don't think we want tracepoint to hold a refcount
on bpf_link (and thus bpf_program), because that invalidates how we do
detachment.

>
> -- Steve
>
>
> >          CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));        \
> >          preempt_enable_notrace();                                       \
> > }
> >