Re: [RFC PATCH] tracing: Fix syscall tracepoint use-after-free

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Wed, 23 Oct 2024 11:13:53 -0400

On 2024-10-23 10:56, Jordan Rife wrote:
Mathieu's patch alone does not seem to be enough to prevent the
use-after-free issue reported by syzbot.

Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@xxxxxxxxxx/T/#u

I reran the repro script with his patch applied to my tree and was
still able to get the same KASAN crash to occur.

In this case, when bpf_link_free is invoked it kicks off three instances
of call_rcu*.

bpf_link_free()
   ops->release()
      bpf_raw_tp_link_release()
        bpf_probe_unregister()
          tracepoint_probe_unregister()
            tracepoint_remove_func()
              release_probes()
                call_rcu()               [1]
   bpf_prog_put()
     __bpf_prog_put()
       bpf_prog_put_deferred()
         __bpf_prog_put_noref()
            call_rcu()                   [2]
   call_rcu()                            [3]

With Mathieu's patch, [1] is chained with call_rcu_tasks_trace()
making the grace period suffiently long to safely free the probe itself.
The callback for [2] and [3] may be invoked before the
call_rcu_tasks_trace() grace period has elapsed leading to the link or
program itself being freed while still in use. I was able to prevent
any crashes with the patch below which also chains
call_rcu_tasks_trace() and call_rcu() at [2] and [3].

Right, so removal of the tracepoint probe is done by
tracepoint_probe_unregister by effectively removing the
probe function from the array. The read-side counterpart
of that is in __DO_TRACE(), where the rcu dereference is
protected by rcu_read_lock_trace for syscall tracepoints
now.

We cannot expect that surrounding the ebpf probe execution
with preempt disable like so:

#define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)                  \
static notrace void                                                     \
__bpf_trace_##call(void *__data, proto)                                 \
{                                                                       \
        might_fault();                                                  \
        preempt_disable_notrace();                                      \
        CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));        \
        preempt_enable_notrace();                                       \
}

Is sufficient to delay reclaim with call_rcu() after a tracepoint
unregister, because the preempt disable does not include the rcu
dereference done by the tracepoint in its critical section.

So relying on a call_rcu() to delay reclaim of the bpf objects
after unregistering their associated tracepoint is indeed not
enough. Chaining call_rcu with call_rcu_tasks_trace works though.

That question is relevant for ftrace and perf too: are there data
structures that are reclaimed with call_rcu() after being unregistered
from syscall tracepoints ?

Thanks Jordan for your thorough analysis,

Mathieu


---
  kernel/bpf/syscall.c | 24 ++++++++++--------------
  1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 59de664e580d..5290eccb465e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
  	bpf_prog_free(aux->prog);
  }
  
+static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu)
+{
+	if (rcu_trace_implies_rcu_gp())
+		__bpf_prog_put_rcu(rcu);
+	else
+		call_rcu(rcu, __bpf_prog_put_rcu);
+}
+
  static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
  {
  	bpf_prog_kallsyms_del_all(prog);
@@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
  		btf_put(prog->aux->attach_btf);
  
  	if (deferred) {
-		if (prog->sleepable)
-			call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu);
-		else
-			call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
+		call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu);
  	} else {
  		__bpf_prog_put_rcu(&prog->aux->rcu);
  	}
@@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu)
  static void bpf_link_free(struct bpf_link *link)
  {
  	const struct bpf_link_ops *ops = link->ops;
-	bool sleepable = false;
  
  	bpf_link_free_id(link->id);
  	if (link->prog) {
-		sleepable = link->prog->sleepable;
  		/* detach BPF program, clean up used resources */
  		ops->release(link);
  		bpf_prog_put(link->prog);
  	}
  	if (ops->dealloc_deferred) {
-		/* schedule BPF link deallocation; if underlying BPF program
-		 * is sleepable, we need to first wait for RCU tasks trace
-		 * sync, then go through "classic" RCU grace period
-		 */
-		if (sleepable)
-			call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
-		else
-			call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);
+		call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
  	} else if (ops->dealloc)
  		ops->dealloc(link);
  }

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com