On Sun, Jun 09, 2024 at 01:21:46AM +0800, Howard Chu wrote: > This is a bug found when implementing pretty-printing for the > landlock_add_rule system call, I decided to send this patch separately > because this is a serious bug that should be fixed fast. > I wrote a test program to do landlock_add_rule syscall in a loop, > yet perf trace -e landlock_add_rule freezes, giving no output. > This bug is introduced by the false understanding of the variable "key" > below: > ``` > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { > struct syscall *sc = trace__syscall_info(trace, NULL, key); > ... > } > ``` > The code above seems right at the beginning, but when looking at > syscalltbl.c, I found these lines: > > ``` > for (i = 0; i <= syscalltbl_native_max_id; ++i) > if (syscalltbl_native[i]) > ++nr_entries; > > entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries); > ... > > for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) { > if (syscalltbl_native[i]) { > entries[j].name = syscalltbl_native[i]; > entries[j].id = i; > ++j; > } > } > ``` > > meaning the key is merely an index to traverse the syscall table, > instead of the actual syscall id for this particular syscall. > So if one uses key to do trace__syscall_info(trace, NULL, key), because > key only goes up to trace->sctbl->syscalls.nr_entries, for example, on > my X86_64 machine, this number is 373, it will end up neglecting all > the rest of the syscall, in my case, everything after `rseq`, because > the traversal will stop at 373, and `rseq` is the last syscall whose id > is lower than 373 > in tools/perf/arch/x86/include/generated/asm/syscalls_64.c: > ``` > ... > [334] = "rseq", > [424] = "pidfd_send_signal", > ... > ``` > > The reason why the key is scrambled but perf trace works well is that > key is used in trace__syscall_info(trace, NULL, key) to do > trace->syscalls.table[id], this makes sure that the struct syscall returned > actually has an id the same value as key, making the later bpf_prog > matching all correct. Right, trace->syscalls.table holds info read from tracefs, while trace->sctbl holds info created from per-arch syscall tables, usually from: ⬢[acme@toolbox perf-tools-next]$ find arch -name "*.tbl" arch/alpha/kernel/syscalls/syscall.tbl arch/arm/tools/syscall.tbl arch/m68k/kernel/syscalls/syscall.tbl arch/microblaze/kernel/syscalls/syscall.tbl arch/mips/kernel/syscalls/syscall_n32.tbl arch/mips/kernel/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_o32.tbl arch/parisc/kernel/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl arch/sh/kernel/syscalls/syscall.tbl arch/sparc/kernel/syscalls/syscall.tbl arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_64.tbl arch/xtensa/kernel/syscalls/syscall.tbl ⬢[acme@toolbox perf-tools-next]$ trace->sctbl->syscalls.entries is sorted by name and has { syscall_id, name } as contents. But it is loaded at the time the reuse of BPF programs is attempted, because initially we were using libaudit for getting id, that we get from the tracepoint payload: root@number:~# cat /sys/kernel/tracing/events/raw_syscalls/sys_enter/format name: sys_enter ID: 361 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:long id; offset:8; size:8; signed:1; ^^ ^^ ^^ ^^ field:unsigned long args[6]; offset:16; size:48; signed:0; print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5] root@number:~# To syscall name. Your analysis is perfect, great! Please take a look at the attached patch, that I'm testing now that is a simplification of your patch, that avoids exposing that __syscall struct, reducing patch size by introducing this instead: +++ b/tools/perf/util/syscalltbl.c @@ -123,6 +123,13 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name) return sc ? sc->id : -1; } +int syscalltbl__id_at_idx(struct syscalltbl *tbl, int idx) +{ + struct syscall *syscalls = tbl->syscalls.entries; + + return idx < tbl->syscalls.nr_entries ? syscalls[idx].id : -1; +} And then using it to go from the index to the syscall id in the loops traversing the sorted trace->sctbl-> > After fixing this bug, I can do perf trace on 38 more syscalls, and > because more syscalls are visible, we get 8 more syscalls that can be > augmented. > > before: > > perf $ perf trace -vv --max-events=1 |& grep Reusing > Reusing "open" BPF sys_enter augmenter for "stat" > Reusing "open" BPF sys_enter augmenter for "lstat" > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" > Reusing "fremovexattr" BPF sys_enter augmenter for "statx" > > TL;DR: I did the above and got: root@number:~# wc -l before after 63 before 71 after 134 total root@number:~# > These are the new syscalls that can be augmented Which matches the 8 more reused bpf programs. > Reusing "openat" BPF sys_enter augmenter for "open_tree" > Reusing "openat" BPF sys_enter augmenter for "openat2" > Reusing "openat" BPF sys_enter augmenter for "mount_setattr" > Reusing "openat" BPF sys_enter augmenter for "move_mount" > Reusing "open" BPF sys_enter augmenter for "fsopen" > Reusing "openat" BPF sys_enter augmenter for "fspick" > Reusing "openat" BPF sys_enter augmenter for "faccessat2" > Reusing "openat" BPF sys_enter augmenter for "fchmodat2" But interestingly I get this: root@number:~# grep -E 'for "(open_tree|openat2|mount_setattr|move_mount|fsopen|fspick|faccessat2|fchmodat2)' after Reusing "faccessat" BPF sys_enter augmenter for "faccessat2" Reusing "faccessat" BPF sys_enter augmenter for "fchmodat2" Reusing "access" BPF sys_enter augmenter for "fsopen" Reusing "faccessat" BPF sys_enter augmenter for "fspick" Reusing "faccessat" BPF sys_enter augmenter for "mount_setattr" Reusing "faccessat" BPF sys_enter augmenter for "move_mount" Reusing "faccessat" BPF sys_enter augmenter for "open_tree" Reusing "faccessat" BPF sys_enter augmenter for "openat2" root@number:~# Which matches my expectations as the array we're traversing, trace->sctbl->syscalls->entries is sorted by name. > as for the perf trace output: > before > > perf $ perf trace -e faccessat2 --max-events=1 > [no output] > > after > > perf $ ./perf trace -e faccessat2 --max-events=1 > 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0 > > P.S. The reason why this bug was not found in the past five years is > probably because it only happens to the newer syscalls whose id is > greater, for instance, faccessat2 of id 439, which not a lot of people > care about when using perf trace. That and the fact that the BPF code was hidden before having to use -e, that got changed kinda recently when we switched to using BPF skels for augmenting syscalls in 'perf trace': ⬢[acme@toolbox perf-tools-next]$ git log --oneline tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c a9f4c6c999008c92 perf trace: Collect sys_nanosleep first argument 29d16de26df17e94 perf augmented_raw_syscalls.bpf: Move 'struct timespec64' to vmlinux.h 5069211e2f0b47e7 perf trace: Use the right bpf_probe_read(_str) variant for reading user data 33b725ce7b988756 perf trace: Avoid compile error wrt redefining bool 7d9642311b6d9d31 perf bpf augmented_raw_syscalls: Add an assert to make sure sizeof(augmented_arg->value) is a power of two. 262b54b6c9396823 perf bpf augmented_raw_syscalls: Add an assert to make sure sizeof(saddr) is a power of two. 1836480429d173c0 perf bpf_skel augmented_raw_syscalls: Cap the socklen parameter using &= sizeof(saddr) cd2cece61ac5f900 perf trace: Tidy comments related to BPF + syscall augmentation 5e6da6be3082f77b perf trace: Migrate BPF augmentation to use a skeleton ⬢[acme@toolbox perf-tools-next]$ ⬢[acme@toolbox perf-tools-next]$ git show --oneline --pretty=reference 5e6da6be3082f77b | head -1 5e6da6be3082f77b (perf trace: Migrate BPF augmentation to use a skeleton, 2023-08-10) ⬢[acme@toolbox perf-tools-next]$ I.e. from August, 2023. One had as well to ask for BUILD_BPF_SKEL=1, which now is default if all it needs is available on the system. I'll test it together with your btf_enum work. - Arnaldo diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c index c42bc608954ee08d..c4fa8191253d4880 100644 --- a/tools/perf/builtin-trace.c +++ b/tools/perf/builtin-trace.c @@ -3354,8 +3354,6 @@ static int trace__bpf_prog_sys_exit_fd(struct trace *trace, int id) static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace, struct syscall *sc) { struct tep_format_field *field, *candidate_field; - int id; - /* * We're only interested in syscalls that have a pointer: */ @@ -3367,7 +3365,8 @@ static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace return NULL; try_to_find_pair: - for (id = 0; id < trace->sctbl->syscalls.nr_entries; ++id) { + for (int i = 0; i < trace->sctbl->syscalls.nr_entries; ++i) { + int id = syscalltbl__id_at_idx(trace->sctbl, i); struct syscall *pair = trace__syscall_info(trace, NULL, id); struct bpf_program *pair_prog; bool is_candidate = false; @@ -3456,10 +3455,10 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) { int map_enter_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_enter); int map_exit_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_exit); - int err = 0, key; + int err = 0; - for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { - int prog_fd; + for (int i = 0; i < trace->sctbl->syscalls.nr_entries; ++i) { + int prog_fd, key = syscalltbl__id_at_idx(trace->sctbl, i); if (!trace__syscall_enabled(trace, key)) continue; @@ -3505,7 +3504,8 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) * first and second arg (this one on the raw_syscalls:sys_exit prog * array tail call, then that one will be used. */ - for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { + for (int i = 0; i < trace->sctbl->syscalls.nr_entries; ++i) { + int key = syscalltbl__id_at_idx(trace->sctbl, i); struct syscall *sc = trace__syscall_info(trace, NULL, key); struct bpf_program *pair_prog; int prog_fd; diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c index 63be7b58761d2f33..0dd26b991b3fb513 100644 --- a/tools/perf/util/syscalltbl.c +++ b/tools/perf/util/syscalltbl.c @@ -123,6 +123,13 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name) return sc ? sc->id : -1; } +int syscalltbl__id_at_idx(struct syscalltbl *tbl, int idx) +{ + struct syscall *syscalls = tbl->syscalls.entries; + + return idx < tbl->syscalls.nr_entries ? syscalls[idx].id : -1; +} + int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx) { int i; diff --git a/tools/perf/util/syscalltbl.h b/tools/perf/util/syscalltbl.h index a41d2ca9e4aebad5..2b53b7ed25a6affe 100644 --- a/tools/perf/util/syscalltbl.h +++ b/tools/perf/util/syscalltbl.h @@ -16,6 +16,7 @@ void syscalltbl__delete(struct syscalltbl *tbl); const char *syscalltbl__name(const struct syscalltbl *tbl, int id); int syscalltbl__id(struct syscalltbl *tbl, const char *name); +int syscalltbl__id_at_idx(struct syscalltbl *tbl, int idx); int syscalltbl__strglobmatch_first(struct syscalltbl *tbl, const char *syscall_glob, int *idx); int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx);