On Fri, Nov 8, 2024 at 12:15 AM Xu Kuohai <xukuohai@xxxxxxxxxxxxxxx> wrote: > > From: Xu Kuohai <xukuohai@xxxxxxxxxx> > > The test in the follow-up patch triggers the following kernel panic: > > Oops: int3: 0000 [#1] PREEMPT SMP PTI > CPU: 0 UID: 0 PID: 465 Comm: test_progs Tainted: G OE 6.12.0-rc4-gd1d187515 > Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-pr4 > RIP: 0010:0xffffffffc0015041 > Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ccc > RSP: 0018:ffffc9000187fd20 EFLAGS: 00000246 > RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: ffffffff82c54639 RDI: 0000000000000000 > RBP: ffffc9000187fd48 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000001 R11: 000000004cba6383 R12: 0000000000000000 > R13: 0000000000000002 R14: ffff88810438b7a0 R15: ffffffff8369d7a0 > FS: 00007f05178006c0(0000) GS:ffff888236e00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f0508c94000 CR3: 0000000100d16003 CR4: 0000000000170ef0 > Call Trace: > <TASK> > ? show_regs+0x68/0x80 > ? die+0x3b/0x90 > ? exc_int3+0xca/0xe0 > ? asm_exc_int3+0x3e/0x50 > run_struct_ops+0x58/0xb0 [bpf_testmod] > param_attr_store+0xa2/0x100 > module_attr_store+0x25/0x40 > sysfs_kf_write+0x50/0x70 > kernfs_fop_write_iter+0x146/0x1f0 > vfs_write+0x27e/0x530 > ksys_write+0x75/0x100 > __x64_sys_write+0x1d/0x30 > x64_sys_call+0x1d30/0x1f30 > do_syscall_64+0x68/0x140 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7f051831727f > Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 108 > RSP: 002b:00007f05177ffdf0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 00007f05178006c0 RCX: 00007f051831727f > RDX: 0000000000000002 RSI: 00007f05177ffe30 RDI: 0000000000000004 > RBP: 00007f05177ffe90 R08: 0000000000000000 R09: 000000000000000b > R10: 0000000000000000 R11: 0000000000000293 R12: ffffffffffffff30 > R13: 0000000000000002 R14: 00007ffd926bfd50 R15: 00007f0517000000 > </TASK> > > It's because the sleepable prog is still running when the struct_ops > map is released. > > To fix it, follow the approach used in bpf_tramp_image_put. First, > before release struct_ops map, wait a rcu_tasks_trace gp for sleepable > progs to finish. Then, wait a rcu_tasks gp for normal progs and the > rest trampoline insns to finish. > > Additionally, switch to call_rcu to remove the synchronous waiting, > as suggested by Alexei and Martin. > > Fixes: b671c2067a04 ("bpf: Retire the struct_ops map kvalue->refcnt.") > Suggested-by: Alexei Starovoitov <ast@xxxxxxxxxx> > Suggested-by: Martin KaFai Lau <martin.lau@xxxxxxxxx> > Signed-off-by: Xu Kuohai <xukuohai@xxxxxxxxxx> > --- > kernel/bpf/bpf_struct_ops.c | 37 +++++++++++++++++++------------------ > kernel/bpf/syscall.c | 7 ++++++- > 2 files changed, 25 insertions(+), 19 deletions(-) > > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c > index fda3dd2ee984..dd5f9bf12791 100644 > --- a/kernel/bpf/bpf_struct_ops.c > +++ b/kernel/bpf/bpf_struct_ops.c > @@ -865,24 +865,6 @@ static void bpf_struct_ops_map_free(struct bpf_map *map) > */ > if (btf_is_module(st_map->btf)) > module_put(st_map->st_ops_desc->st_ops->owner); > - > - /* The struct_ops's function may switch to another struct_ops. > - * > - * For example, bpf_tcp_cc_x->init() may switch to > - * another tcp_cc_y by calling > - * setsockopt(TCP_CONGESTION, "tcp_cc_y"). > - * During the switch, bpf_struct_ops_put(tcp_cc_x) is called > - * and its refcount may reach 0 which then free its > - * trampoline image while tcp_cc_x is still running. > - * > - * A vanilla rcu gp is to wait for all bpf-tcp-cc prog > - * to finish. bpf-tcp-cc prog is non sleepable. > - * A rcu_tasks gp is to wait for the last few insn > - * in the tramopline image to finish before releasing > - * the trampoline image. > - */ > - synchronize_rcu_mult(call_rcu, call_rcu_tasks); > - > __bpf_struct_ops_map_free(map); > } > > @@ -974,6 +956,25 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) > mutex_init(&st_map->lock); > bpf_map_init_from_attr(map, attr); > > + /* The struct_ops's function may switch to another struct_ops. > + * > + * For example, bpf_tcp_cc_x->init() may switch to > + * another tcp_cc_y by calling > + * setsockopt(TCP_CONGESTION, "tcp_cc_y"). > + * During the switch, bpf_struct_ops_put(tcp_cc_x) is called > + * and its refcount may reach 0 which then free its > + * trampoline image while tcp_cc_x is still running. > + * > + * Since struct_ops prog can be sleepable, a rcu_tasks_trace gp > + * is to wait for sleepable progs in the map to finish. Then a > + * rcu_tasks gp is to wait for the normal progs and the last few > + * insns in the tramopline image to finish before releasing the > + * trampoline image. > + * > + * Also see the comment in function bpf_tramp_image_put. > + */ > + WRITE_ONCE(map->free_after_mult_rcu_gp, true); > + > return map; > > errout_free: > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 8254b2973157..ae927f087f04 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -886,7 +886,12 @@ static void bpf_map_free_rcu_gp(struct rcu_head *rcu) > > static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu) > { > - if (rcu_trace_implies_rcu_gp()) > + struct bpf_map *map = container_of(rcu, struct bpf_map, rcu); > + > + if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) > + /* See comment in the end of bpf_struct_ops_map_alloc */ The fix makes sense, but pls copy paste a sentence here instead of the above comment. Like: " rcu_tasks gp is necessary to wait for struct_ops bpf trampoline to finish. Unlike all other bpf trampolines struct_ops trampoline is not protected by percpu_ref. " > + call_rcu_tasks(rcu, bpf_map_free_rcu_gp); > + else if (rcu_trace_implies_rcu_gp()) pw-bot: cr