Hi Paul, Looks like sync rcu_tasks_trace got slower or we simply didn't notice it earlier. In selftests/bpf try: time ./test_progs -t trampoline_count #101 trampoline_count:OK Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED real 1m17.082s user 0m0.145s sys 0m1.369s But with the following hack: diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 7dd523a7e32d..c417b817ec5d 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -217,7 +217,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) * programs finish executing. * Wait for these two grace periods together. */ - synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace); +// synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace); I see: time ./test_progs -t trampoline_count #101 trampoline_count:OK Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED real 0m1.588s user 0m0.131s sys 0m1.342s It takes an extra minute to do 40 sync rcu_tasks_trace calls. It means that every sync takes more than a second. That feels excessive. Doing: - synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace); + synchronize_rcu(); is also fast: time ./test_progs -t trampoline_count #101 trampoline_count:OK Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED real 0m2.089s user 0m0.139s sys 0m1.282s sync rcu_tasks() is fast too: - synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace); + synchronize_rcu_tasks(); time ./test_progs -t trampoline_count #101 trampoline_count:OK Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED real 0m2.209s user 0m0.117s sys 0m1.344s so it's really something going on with sync rcu_tasks_trace. Could you please take a look?