Re: [RFC PATCH v3 3/3] tracing: Fix syscall tracepoint use-after-free

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Thu, 31 Oct 2024 11:43:07 -0400

On 2024-10-28 15:19, Mathieu Desnoyers wrote:
On 2024-10-27 21:22, Andrii Nakryiko wrote:
On Sat, Oct 26, 2024 at 8:48 AM Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:

The grace period used internally within tracepoint.c:release_probes()
uses call_rcu() to batch waiting for quiescence of old probe arrays,
rather than using the tracepoint_synchronize_unregister() which blocks
while waiting for quiescence.

With the introduction of faultable syscall tracepoints, this causes
use-after-free issues reproduced with syzkaller.

Fix this by using the appropriate call_rcu() or call_rcu_tasks_trace()
before invoking the rcu_free_old_probes callback. This can be chosen
using the tracepoint_is_syscall() API.

A similar issue exists in bpf use of call_rcu(). Fixing this is left to
a separate change.

Reported-by: syzbot+b390c8062d8387b6272a@xxxxxxxxxxxxxxxxxxxxxxxxx
Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to 
handle page faults")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Cc: Michael Jeanson <mjeanson@xxxxxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Masami Hiramatsu <mhiramat@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Alexei Starovoitov <ast@xxxxxxxxxx>
Cc: Yonghong Song <yhs@xxxxxx>
Cc: Paul E. McKenney <paulmck@xxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>
Cc: Mark Rutland <mark.rutland@xxxxxxx>
Cc: Alexander Shishkin <alexander.shishkin@xxxxxxxxxxxxxxx>
Cc: Namhyung Kim <namhyung@xxxxxxxxxx>
Cc: Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx>
Cc: bpf@xxxxxxxxxxxxxxx
Cc: Joel Fernandes <joel@xxxxxxxxxxxxxxxxx>
Cc: Jordan Rife <jrife@xxxxxxxxxx>
---
Changes since v0:
- Introduce tracepoint_call_rcu(),
- Fix bpf_link_free() use of call_rcu as well.

Changes since v1:
- Use tracepoint_call_rcu() for bpf_prog_put as well.

Changes since v2:
- Do not cover bpf changes in the same commit, let bpf developers
   implement it.
---
  kernel/tracepoint.c | 11 +++++++----
  1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 5658dc92f5b5..47569fb06596 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -106,13 +106,16 @@ static void rcu_free_old_probes(struct rcu_head 
*head)
         kfree(container_of(head, struct tp_probes, rcu));
  }

-static inline void release_probes(struct tracepoint_func *old)
+static inline void release_probes(struct tracepoint *tp, struct 
tracepoint_func *old)
  {
         if (old) {
                 struct tp_probes *tp_probes = container_of(old,
                         struct tp_probes, probes[0]);

-               call_rcu(&tp_probes->rcu, rcu_free_old_probes);
+               if (tracepoint_is_syscall(tp))
+                       call_rcu_tasks_trace(&tp_probes->rcu, 
rcu_free_old_probes);

should this be call_rcu_tasks_trace() -> call_rcu() chain instead of
just call_rcu_tasks_trace()? While currently call_rcu_tasks_trace()
implies RCU GP (as evidenced by rcu_trace_implies_rcu_gp() being
hardcoded right now to returning true), this might not always be the
case in the future, so it's best to have a guarantee that regardless
of sleepable or not, we'll always have have RCU GP, and for sleepable
tracepoint *also* RCU Tasks Trace GP.

Given that faultable tracepoints only use RCU tasks trace for the
read-side and do not rely on preempt disable, I don't see why we would
need to chain both grace periods there ?

Hi Andrii,

AFAIU, your question above is rooted in the way bpf does its sleepable
program grace periods (chaining RCU tasks trace + RCU GP), e.g.:

bpf_map_free_mult_rcu_gp
bpf_link_defer_dealloc_mult_rcu_gp

and

bpf_link_free:
                /* schedule BPF link deallocation; if underlying BPF program
                 * is sleepable, we need to first wait for RCU tasks trace
                 * sync, then go through "classic" RCU grace period
                 */

This is introduced in commit 1a80dbcb2db ("bpf: support deferring bpf_link dealloc to after RCU grace period")
which has a bit more information in the commit message, but what I'm not seeing
is an explanation of *why* chaining RCU tasks trace and RCU grace periods is
needed for sleepable bpf programs. What am I missing ?

As far as tracepoint.c release_probes() is concerned, just waiting for
RCU tasks trace before freeing memory of faultable tracepoints is
sufficient.

Thanks,

Mathieu


Thanks,

Mathieu


+               else
+                       call_rcu(&tp_probes->rcu, rcu_free_old_probes);
         }
  }

@@ -334,7 +337,7 @@ static int tracepoint_add_func(struct tracepoint 
*tp,
                 break;
         }

-       release_probes(old);
+       release_probes(tp, old);
         return 0;
  }

@@ -405,7 +408,7 @@ static int tracepoint_remove_func(struct 
tracepoint *tp,
                 WARN_ON_ONCE(1);
                 break;
         }
-       release_probes(old);
+       release_probes(tp, old);
         return 0;
  }

--
2.39.5



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com