On Fri, Nov 1, 2024 at 4:08 AM Xu Kuohai <xukuohai@xxxxxxxxxxxxxxx> wrote: > > From: Xu Kuohai <xukuohai@xxxxxxxxxx> > > Without kernel symbols for struct_ops trampoline, the unwinder may > produce unexpected stacktraces. > > For example, the x86 ORC and FP unwinders check if an IP is in kernel > text by verifying the presence of the IP's kernel symbol. When a > struct_ops trampoline address is encountered, the unwinder stops due > to the absence of symbol, resulting in an incomplete stacktrace that > consists only of direct and indirect child functions called from the > trampoline. > > The arm64 unwinder is another example. While the arm64 unwinder can > proceed across a struct_ops trampoline address, the corresponding > symbol name is displayed as "unknown", which is confusing. > > Thus, add kernel symbol for struct_ops trampoline. The name is > bpf_trampoline_<PROG_NAME>, where PROG_NAME is the name of the > struct_ops prog linked to the trampoline. > > Below is a comparison of stacktraces captured on x86 by perf record, > before and after this patch. > > Before: > > ... FP chain: nr:4 > ..... 0: ffffffffffffff80 # PERF_CONTEXT_KERNEL mark > ..... 1: ffffffff8116545d > ..... 2: ffffffff81167fcc > ..... 3: ffffffff813088f4 > ... thread: iperf:595 > ...... dso: /proc/kcore > iperf 595 [002] 9015.616291: 824245 cycles: > ffffffff8116545d __lock_acquire+0xad ([kernel.kallsyms]) > ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms]) > ffffffff813088f4 __bpf_prog_enter+0x34 ([kernel.kallsyms]) > > After: > > ... FP chain: nr:44 > ..... 0: ffffffffffffff80 # PERF_CONTEXT_KERNEL mark > ..... 1: ffffffff81165930 > ..... 2: ffffffff81167fcc > ..... 3: ffffffff813088f4 > ..... 4: ffffffffc000da5e > ..... 5: ffffffff81f243df > ..... 6: ffffffff81f27326 > ..... 7: ffffffff81f3a3c3 > ..... 8: ffffffff81f3c99b > ..... 9: ffffffff81ef9870 > ..... 10: ffffffff81ef9b13 > ..... 11: ffffffff81ef9c69 > ..... 12: ffffffff81ef9f47 > ..... 13: ffffffff81efa15d > ..... 14: ffffffff81efa9c0 > ..... 15: ffffffff81d979eb > ..... 16: ffffffff81d987e8 > ..... 17: ffffffff81ddce16 > ..... 18: ffffffff81bc7b90 > ..... 19: ffffffff81bcf677 > ..... 20: ffffffff81bd1b4f > ..... 21: ffffffff81d99693 > ..... 22: ffffffff81d99a52 > ..... 23: ffffffff810c9eb2 > ..... 24: ffffffff810ca631 > ..... 25: ffffffff822367db > ..... 26: ffffffff824015ef > ..... 27: ffffffff811678e6 > ..... 28: ffffffff814f7d85 > ..... 29: ffffffff814f8119 > ..... 30: ffffffff81492fb9 > ..... 31: ffffffff81355c53 > ..... 32: ffffffff813d79d7 > ..... 33: ffffffff813d88fc > ..... 34: ffffffff8139a52e > ..... 35: ffffffff8139a661 > ..... 36: ffffffff8152c193 > ..... 37: ffffffff8152cbc5 > ..... 38: ffffffff814a5908 > ..... 39: ffffffff814a72d3 > ..... 40: ffffffff814a758b > ..... 41: ffffffff81008869 > ..... 42: ffffffff822323e8 > ..... 43: ffffffff8240012f The above is a visual noise. Pls remove such addr dumps from the commit log. The below part is enough. > ... thread: sleep:493 > ...... dso: /proc/kcore > sleep 493 [000] 55.483168: 410428 cycles: > ffffffff81165930 __lock_acquire+0x580 ([kernel.kallsyms]) > ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms]) > ffffffff813088f4 __bpf_prog_enter+0x34 ([kernel.kallsyms]) > ffffffffc000da5e bpf_trampoline_bpf_prog_075f577900bac1d2_bpf_cubic_acked+0x3a ([kernel.kallsyms]) > ffffffff81f243df tcp_ack+0xd4f ([kernel.kallsyms]) > ffffffff81f27326 tcp_rcv_established+0x3b6 ([kernel.kallsyms]) > ffffffff81f3a3c3 tcp_v4_do_rcv+0x193 ([kernel.kallsyms]) > ffffffff81f3c99b tcp_v4_rcv+0x11fb ([kernel.kallsyms]) > ffffffff81ef9870 ip_protocol_deliver_rcu+0x50 ([kernel.kallsyms]) > ffffffff81ef9b13 ip_local_deliver_finish+0xb3 ([kernel.kallsyms]) > ffffffff81ef9c69 ip_local_deliver+0x79 ([kernel.kallsyms]) > ffffffff81ef9f47 ip_sublist_rcv_finish+0xb7 ([kernel.kallsyms]) > ffffffff81efa15d ip_sublist_rcv+0x18d ([kernel.kallsyms]) > ffffffff81efa9c0 ip_list_rcv+0x110 ([kernel.kallsyms]) > ffffffff81d979eb __netif_receive_skb_list_core+0x21b ([kernel.kallsyms]) > ffffffff81d987e8 netif_receive_skb_list_internal+0x208 ([kernel.kallsyms]) > ffffffff81ddce16 napi_gro_receive+0xf6 ([kernel.kallsyms]) > ffffffff81bc7b90 virtnet_receive_done+0x340 ([kernel.kallsyms]) > ffffffff81bcf677 receive_buf+0xd7 ([kernel.kallsyms]) > ffffffff81bd1b4f virtnet_poll+0xcbf ([kernel.kallsyms]) > ffffffff81d99693 __napi_poll.constprop.0+0x33 ([kernel.kallsyms]) > ffffffff81d99a52 net_rx_action+0x1c2 ([kernel.kallsyms]) > ffffffff810c9eb2 handle_softirqs+0xe2 ([kernel.kallsyms]) > ffffffff810ca631 irq_exit_rcu+0x91 ([kernel.kallsyms]) > ffffffff822367db sysvec_apic_timer_interrupt+0x9b ([kernel.kallsyms]) > ffffffff824015ef asm_sysvec_apic_timer_interrupt+0x1f ([kernel.kallsyms]) > ffffffff811678e6 lock_release+0x186 ([kernel.kallsyms]) > ffffffff814f7d85 prepend_path+0x395 ([kernel.kallsyms]) > ffffffff814f8119 d_path+0x159 ([kernel.kallsyms]) > ffffffff81492fb9 file_path+0x19 ([kernel.kallsyms]) > ffffffff81355c53 perf_event_mmap+0x1e3 ([kernel.kallsyms]) > ffffffff813d79d7 mmap_region+0x2e7 ([kernel.kallsyms]) > ffffffff813d88fc do_mmap+0x4ec ([kernel.kallsyms]) > ffffffff8139a52e vm_mmap_pgoff+0xde ([kernel.kallsyms]) > ffffffff8139a661 vm_mmap+0x31 ([kernel.kallsyms]) > ffffffff8152c193 elf_load+0xa3 ([kernel.kallsyms]) > ffffffff8152cbc5 load_elf_binary+0x655 ([kernel.kallsyms]) > ffffffff814a5908 bprm_execve+0x2a8 ([kernel.kallsyms]) > ffffffff814a72d3 do_execveat_common.isra.0+0x193 ([kernel.kallsyms]) > ffffffff814a758b __x64_sys_execve+0x3b ([kernel.kallsyms]) > ffffffff81008869 x64_sys_call+0x1399 ([kernel.kallsyms]) > ffffffff822323e8 do_syscall_64+0x68 ([kernel.kallsyms]) > ffffffff8240012f entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms]) > > Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS") > Signed-off-by: Xu Kuohai <xukuohai@xxxxxxxxxx> > Acked-by: Yonghong Song <yonghong.song@xxxxxxxxx> > --- > v2: > Refine the commit message for clarity and fix a test bot warning > > v1: > https://lore.kernel.org/bpf/20241030111533.907289-1-xukuohai@xxxxxxxxxxxxxxx/ > --- > include/linux/bpf.h | 3 +- > kernel/bpf/bpf_struct_ops.c | 67 +++++++++++++++++++++++++++++++++++++ > kernel/bpf/dispatcher.c | 3 +- > kernel/bpf/trampoline.c | 9 +++-- > 4 files changed, 78 insertions(+), 4 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index c3ba4d475174..46f8d6c1a55c 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -1402,7 +1402,8 @@ int arch_prepare_bpf_dispatcher(void *image, void *buf, s64 *funcs, int num_func > void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from, > struct bpf_prog *to); > /* Called only from JIT-enabled code, so there's no need for stubs. */ > -void bpf_image_ksym_add(void *data, unsigned int size, struct bpf_ksym *ksym); > +void bpf_image_ksym_init(void *data, unsigned int size, struct bpf_ksym *ksym); > +void bpf_image_ksym_add(struct bpf_ksym *ksym); > void bpf_image_ksym_del(struct bpf_ksym *ksym); > void bpf_ksym_add(struct bpf_ksym *ksym); > void bpf_ksym_del(struct bpf_ksym *ksym); > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c > index fda3dd2ee984..172a081ed1c3 100644 > --- a/kernel/bpf/bpf_struct_ops.c > +++ b/kernel/bpf/bpf_struct_ops.c > @@ -38,6 +38,9 @@ struct bpf_struct_ops_map { > * that stores the func args before calling the bpf_prog. > */ > void *image_pages[MAX_TRAMP_IMAGE_PAGES]; > + u32 ksyms_cnt; > + /* ksyms for bpf trampolines */ > + struct bpf_ksym *ksyms; > /* The owner moduler's btf. */ > struct btf *btf; > /* uvalue->data stores the kernel struct > @@ -586,6 +589,35 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, > return 0; > } > > +static void bpf_struct_ops_ksym_init(struct bpf_prog *prog, void *image, > + unsigned int size, struct bpf_ksym *ksym) > +{ > + int n; > + > + n = strscpy(ksym->name, "bpf_trampoline_", KSYM_NAME_LEN); > + strncat(ksym->name + n, prog->aux->ksym.name, KSYM_NAME_LEN - 1 - n); > + INIT_LIST_HEAD_RCU(&ksym->lnode); > + bpf_image_ksym_init(image, size, ksym); > +} > + > +static void bpf_struct_ops_map_ksyms_add(struct bpf_struct_ops_map *st_map) > +{ > + struct bpf_ksym *ksym = st_map->ksyms; > + struct bpf_ksym *end = ksym + st_map->ksyms_cnt; > + > + while (ksym != end && ksym->start) > + bpf_image_ksym_add(ksym++); > +} > + > +static void bpf_struct_ops_map_ksyms_del(struct bpf_struct_ops_map *st_map) > +{ > + struct bpf_ksym *ksym = st_map->ksyms; > + struct bpf_ksym *end = ksym + st_map->ksyms_cnt; > + > + while (ksym != end && ksym->start) > + bpf_image_ksym_del(ksym++); > +} > + > static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, > void *value, u64 flags) > { > @@ -601,6 +633,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, > int prog_fd, err; > u32 i, trampoline_start, image_off = 0; > void *cur_image = NULL, *image = NULL; > + struct bpf_ksym *ksym; > > if (flags) > return -EINVAL; > @@ -640,6 +673,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, > kdata = &kvalue->data; > > module_type = btf_type_by_id(btf_vmlinux, st_ops_ids[IDX_MODULE_ID]); > + ksym = st_map->ksyms; > for_each_member(i, t, member) { > const struct btf_type *mtype, *ptype; > struct bpf_prog *prog; > @@ -735,6 +769,11 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, > > /* put prog_id to udata */ > *(unsigned long *)(udata + moff) = prog->aux->id; > + > + /* init ksym for this trampoline */ > + bpf_struct_ops_ksym_init(prog, image + trampoline_start, > + image_off - trampoline_start, > + ksym++); Thanks for the patch. I think it's overkill to add ksym for each callsite within a single trampoline. 1. The prog name will be next in the stack. No need to duplicate it. 2. ksym-ing callsites this way is quite unusual. 3. consider irq on other insns within a trampline. The unwinder won't find anything in such a case. So I suggest to add only one ksym that covers the whole trampoline. The name could be "bpf_trampoline_structopsname" that is probably st_ops_desc->type. > } > > if (st_ops->validate) { > @@ -790,6 +829,8 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, > unlock: > kfree(tlinks); > mutex_unlock(&st_map->lock); > + if (!err) > + bpf_struct_ops_map_ksyms_add(st_map); > return err; > } > > @@ -883,6 +924,10 @@ static void bpf_struct_ops_map_free(struct bpf_map *map) > */ > synchronize_rcu_mult(call_rcu, call_rcu_tasks); > > + /* no trampoline in the map is running anymore, delete symbols */ > + bpf_struct_ops_map_ksyms_del(st_map); > + synchronize_rcu(); > + This is substantial overhead and why ? synchronize_rcu_mult() is right above. pw-bot: cr