On Tue, 2024-08-20 at 17:38 -0700, Tony Ambardar wrote: [...] > I used the command line: > ./test_progs -d > get_stack_raw_tp,stacktrace_build_id,verifier_iterating_callbacks,tai > lcalls > > which includes the current DENYLIST.s390x as well as 'tailcalls', > which > is also excluded by the kernel-patches/bpf s390x CI. I note the CI > excludes several more tests that seem to work. Any idea why that is? > > For reference, the issue with 'tailcalls/tailcall_hierarchy_count' is > an > RCU stall and kernel hang: > > root@(none):/usr/libexec/kselftests-bpf# ./test_progs -v --debug -n > 332/19 > bpf_testmod.ko is already unloaded. > Loading bpf_testmod.ko... > Successfully loaded bpf_testmod.ko. > test_tailcall_hierarchy_count:PASS:load obj 0 nsec > test_tailcall_hierarchy_count:PASS:find entry prog 0 nsec > test_tailcall_hierarchy_count:PASS:prog_fd 0 nsec > test_tailcall_hierarchy_count:PASS:find jmp_table 0 nsec > test_tailcall_hierarchy_count:PASS:map_fd 0 nsec > test_tailcall_hierarchy_count:PASS:update jmp_table 0 nsec > test_tailcall_hierarchy_count:PASS:find data_map 0 nsec > test_tailcall_hierarchy_count:PASS:open fentry_obj file 0 nsec > test_tailcall_hierarchy_count:PASS:find fentry prog 0 nsec > test_tailcall_hierarchy_count:PASS:set_attach_target subprog_tail 0 > nsec > test_tailcall_hierarchy_count:PASS:load fentry_obj 0 nsec > test_tailcall_hierarchy_count:PASS:attach_trace 0 nsec > rcu: INFO: rcu_sched self-detected stall on CPU > rcu: 0-....: (1 GPs behind) idle=4eb4/1/0x4000000000000000 > softirq=527/528 fqs=1050 > rcu: (t=2100 jiffies g=-379 q=20 ncpus=2) > CPU: 0 UID: 0 PID: 84 Comm: test_progs Tainted: G O > 6.10.0-12706-g853081e84612-dirty #111 > Tainted: [O]=OOT_MODULE > Hardware name: QEMU 8561 QEMU (KVM/Linux) > Krnl PSW : 0704f00180000000 000003ffe00f8fca > (lock_release+0xf2/0x190) > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 > EA:3 > Krnl GPRS: 00000000b298dd12 0000000000000000 000002f23fd767c8 > 000003ffe1848800 > 0000000000000001 0000037fe034edbc 0000037fe034fd74 > 0000000000000001 > 0700037fe034edc8 000003ffe0249e48 000003ffe1848800 > 000003ffe19ba7c8 > 000003ff9f7a7f90 0000037fe034ef00 000003ffe00f8f96 > 0000037fe034ed78 > Krnl Code: 000003ffe00f8fbe: a7820300 tmhh %r8,768 > 000003ffe00f8fc2: a7840004 brc > 8,000003ffe00f8fca > #000003ffe00f8fc6: ad03f0a0 stosm 160(%r15),3 > >000003ffe00f8fca: eb8ff0a80004 lmg > %r8,%r15,168(%r15) > 000003ffe00f8fd0: 07fe bcr 15,%r14 > 000003ffe00f8fd2: c0e500011057 brasl > %r14,000003ffe011b080 > 000003ffe00f8fd8: ec26ffa6007e cij > %r2,0,6,000003ffe00f8f24 > 000003ffe00f8fde: c01000b78b96 larl > %r1,000003ffe17ea70a > Call Trace: > [<000003ffe00f8fca>] lock_release+0xf2/0x190 > ([<000003ffe00f8f96>] lock_release+0xbe/0x190) > [<000003ffe0249ea4>] __bpf_prog_exit_recur+0x5c/0x68 > [<000003ff6001e0b0>] bpf_trampoline_73014444060+0xb0/0xd2 > [<000003ff60024d14>] bpf_prog_eb7edc599e93dcc8_entry+0x5c/0xc8 > [<000003ff60024d14>] bpf_prog_eb7edc599e93dcc8_entry+0x5c/0xc8 > [<000003ff60024d14>] bpf_prog_eb7edc599e93dcc8_entry+0x5c/0xc8 > [<000003ff60024d2a>] bpf_prog_eb7edc599e93dcc8_entry+0x72/0xc8 > [<000003ff60024d2a>] bpf_prog_eb7edc599e93dcc8_entry+0x72/0xc8 > [<000003ff60024d14>] bpf_prog_eb7edc599e93dcc8_entry+0x5c/0xc8 > [<000003ff60024d14>] bpf_prog_eb7edc599e93dcc8_entry+0x5c/0xc8 > [<000003ff60024d14>] bpf_prog_eb7edc599e93dcc8_entry+0x5c/0xc8 > [<000003ffe084ecee>] bpf_test_run+0x216/0x3a8 > [<000003ffe084f9cc>] bpf_prog_test_run_skb+0x21c/0x630 > [<000003ffe0202ad2>] __sys_bpf+0x7ea/0xbb0 > [<000003ffe0203114>] __s390x_sys_bpf+0x44/0 Thanks for the detailed analysis! I will need to port commit 116e04ba1459fc08f80cf27b8c9f9f188be0fcb2 Author: Leon Hwang <hffilwlqm@xxxxxxxxx> Date: Sun Jul 14 20:39:00 2024 +0800 bpf, x64: Fix tailcall hierarchy to s390x to fix this. > Another curiosity is with 'uprobe_multi_test/attach_uprobe_fails', > which usually succeeds but generates an inode warning in > kernel/events/uprobes.c: (with cross-compiled and native test_progs) > > #416 uprobe_autoattach:OK > ref_ctr_offset mismatch. inode: 0x73c7 offset: 0x3c9b78 > ref_ctr_offset(old): 0x464d7be ref_ctr_offset(new): 0x464d7bc > #417/1 uprobe_multi_test/skel_api:OK > #417/2 uprobe_multi_test/attach_api_pattern:OK > #417/3 uprobe_multi_test/attach_api_syms:OK > #417/4 uprobe_multi_test/link_api:OK > #417/5 uprobe_multi_test/bench_uprobe:OK > #417/6 uprobe_multi_test/bench_usdt:OK > #417/7 uprobe_multi_test/attach_api_fails:OK > #417/8 uprobe_multi_test/attach_uprobe_fails:OK > #417/9 uprobe_multi_test/consumers:OK > #417 uprobe_multi_test:OK > > but occasionally I see this kernel fault: > > #416 uprobe_autoattach:OK > User process fault: interruption code 0001 ilc:1 in > test_progs[3c9ba2,2aa3b580000+cc5000] > CPU: 0 UID: 0 PID: 165 Comm: new_name Tainted: G OE > 6.10.0-12707-g8189b8007d01 #114 > Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE > Hardware name: QEMU 8561 QEMU (KVM/Linux) > User PSW : 0705000180000000 000002aa3b949ba2 > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:0 PM:0 RI:0 > EA:3 > User GPRS: cccccccccccccccd 0000000000000000 000003ffbe080000 > 0000000000000000 > 000003ffbeb74828 0000000000000006 0000000000000000 > 000002aa3c245928 > 000003ffbeb2cbc0 000003ffbeb2d020 0000000000000003 > 000003ffdb379f20 > 000003ffbeb2cf98 0000000000000000 000002aa3b94a400 > 000003ffdb379f20 > User Code:>000002aa3b949ba2: 0000 illegal > 000002aa3b949ba4: 0700 bcr 0,%r0 > 000002aa3b949ba6: b3cd00b0 lgdr %r11,%f0 > 000002aa3b949baa: 07fe bcr 15,%r14 > 000002aa3b949bac: 0707 bcr 0,%r7 > 000002aa3b949bae: 0707 bcr 0,%r7 > 000002aa3b949bb0: ebbff0580024 stmg > %r11,%r15,88(%r15) > 000002aa3b949bb6: e3f0ff48ff71 lay %r15,- > 184(%r15) > Last Breaking-Event-Address: > [<000002aa3b94a3fa>] test_progs[3ca3fa,2aa3b580000+cc5000] > > > Have you seen this fault before? Is the inode warning expected by the > test? Yes, this is caused by: /* attach fail due to wrong ref_ctr_offs on one of the uprobes */ attach_uprobe_fail_refctr(skel); The fault is a user fault, not a kernel fault. I could not reproduce it on a real s390x machine. This may be an emulation problem, since apparently the kernel does not recognize that "0000 illegal" is an uprobe. Quite some time ago I fixed a similar issue in this area, perhaps it's a new flavour. I will investigate. [...] Best regards, Ilya