Re: [RFC PATCH] tracing: Fix syscall tracepoint use-after-free

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Wed, 23 Oct 2024 11:19:40 -0400

On 2024-10-23 11:14, Alexei Starovoitov wrote:
On Wed, Oct 23, 2024 at 7:56 AM Jordan Rife <jrife@xxxxxxxxxx> wrote:

Mathieu's patch alone does not seem to be enough to prevent the
use-after-free issue reported by syzbot.

Link: https://lore.kernel.org/bpf/67121037.050a0220.10f4f4.000f.GAE@xxxxxxxxxx/T/#u

I reran the repro script with his patch applied to my tree and was
still able to get the same KASAN crash to occur.

In this case, when bpf_link_free is invoked it kicks off three instances
of call_rcu*.

bpf_link_free()
   ops->release()
      bpf_raw_tp_link_release()
        bpf_probe_unregister()
          tracepoint_probe_unregister()
            tracepoint_remove_func()
              release_probes()
                call_rcu()               [1]
   bpf_prog_put()
     __bpf_prog_put()
       bpf_prog_put_deferred()
         __bpf_prog_put_noref()
            call_rcu()                   [2]
   call_rcu()                            [3]

With Mathieu's patch, [1] is chained with call_rcu_tasks_trace()
making the grace period suffiently long to safely free the probe itself.
The callback for [2] and [3] may be invoked before the
call_rcu_tasks_trace() grace period has elapsed leading to the link or
program itself being freed while still in use. I was able to prevent
any crashes with the patch below which also chains
call_rcu_tasks_trace() and call_rcu() at [2] and [3].

---
  kernel/bpf/syscall.c | 24 ++++++++++--------------
  1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 59de664e580d..5290eccb465e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2200,6 +2200,14 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
         bpf_prog_free(aux->prog);
  }

+static void __bpf_prog_put_tasks_trace_rcu(struct rcu_head *rcu)
+{
+       if (rcu_trace_implies_rcu_gp())
+               __bpf_prog_put_rcu(rcu);
+       else
+               call_rcu(rcu, __bpf_prog_put_rcu);
+}
+
  static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
  {
         bpf_prog_kallsyms_del_all(prog);
@@ -2212,10 +2220,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred)
                 btf_put(prog->aux->attach_btf);

         if (deferred) {
-               if (prog->sleepable)
-                       call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu);
-               else
-                       call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
+               call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_tasks_trace_rcu);
         } else {
                 __bpf_prog_put_rcu(&prog->aux->rcu);
         }
@@ -2996,24 +3001,15 @@ static void bpf_link_defer_dealloc_mult_rcu_gp(struct rcu_head *rcu)
  static void bpf_link_free(struct bpf_link *link)
  {
         const struct bpf_link_ops *ops = link->ops;
-       bool sleepable = false;

         bpf_link_free_id(link->id);
         if (link->prog) {
-               sleepable = link->prog->sleepable;
                 /* detach BPF program, clean up used resources */
                 ops->release(link);
                 bpf_prog_put(link->prog);
         }
         if (ops->dealloc_deferred) {
-               /* schedule BPF link deallocation; if underlying BPF program
-                * is sleepable, we need to first wait for RCU tasks trace
-                * sync, then go through "classic" RCU grace period
-                */
-               if (sleepable)
-                       call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
-               else
-                       call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);
+               call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);

This patch is completely wrong.

Actually I suspect Jordan's patch works.

Looks like Mathieu patch broke bpf program contract somewhere.

My patch series introduced this in the probe:

#define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)                  \
static notrace void                                                     \
__bpf_trace_##call(void *__data, proto)                                 \
{                                                                       \
        might_fault();                                                  \
        preempt_disable_notrace();                                      \
        CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));        \
        preempt_enable_notrace();                                       \
}

To ensure we'd call the bpf program from preempt-off context.

The tracepoint type bpf programs must be called with rcu_read_lock held.

Calling the bpf program with preempt off is equivalent. __DO_TRACE() calls
the probes with preempt_disable_notrace() in the normal case.

Looks like it's not happening anymore.

The issue here is not about the context in which the bpf program runs, that's
still preempt off. The problem is about expectations that a call_rcu grace period
is enough to delay reclaim after unregistration of the tracepoint. Now that
__DO_TRACE() uses rcu_read_lock_trace() to protect RCU dereference, it's not
sufficient anymore, at least for syscall tracepoints.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com