On Mon, Mar 18, 2024 at 1:38 PM Ilya Leoshkevich <iii@xxxxxxxxxxxxx> wrote: > > On Fri, Mar 15, 2024 at 09:59:52AM -0700, Kui-Feng Lee wrote: > > Hi Josh, Sumath, and Heiko, > > > > Recently, all patches submitted to bpf have been failing the selftest > > "dummy_st_ops/dummy_init_ptr_arg" on the s390x platform. Upon > > investigation, it was determined that a specific patch, 778666df60f0 > > ("s390: compile relocatable kernel without -fPIE"), is causing the problem. > > It is the first bad after bisecting. Please take a look. > > Thank you! > > > > You can reproduce it by running the command > > > > ./test_progs -a dummy_st_ops/dummy_init_ptr_arg > > > > in the tools/testing/selftests/bpf directory in the Linux source tree. > > This command causes a kernel crash with the following messages. It > > appears to be related to attaching a trace program at the fentry of a > > struct_ops operator. The same operator works fine without the trace > > program. For more details about the test case, please check > > dummy_st_ops.c in tools/testing/selftests/bpf/prog_tests/. > > > > illegal operation: 0001 ilc:1 [#1] SMP > > Modules linked in: > > CPU: 0 PID: 82 Comm: test_progs Tainted: G OE > > 6.8.0-rc3-g778666df60f0 #34 > > Hardware name: QEMU 3906 QEMU (KVM/Linux) > > Krnl PSW : 0704d00180000000 0000000000000002 (0x2) > > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 > > Krnl GPRS: 000003ff800a4956 0000000000000000 0000038000a73d20 > > 0000000082a08b58 > > 0000000000000000 000000007083ede2 0000000000000000 > > 0000000000000001 > > 00000000009d580e 000000008286bc00 0000037f000000d8 > > 000003ff800b2000 > > 0000000081a43c00 00000000011bfe20 000003ff800b208e > > 0000038000a73c68 > > Krnl Code:#0000000000000000: 0000 illegal > > >0000000000000002: 0000 illegal > > 0000000000000004: 0000 illegal > > 0000000000000006: 0000 illegal > > 0000000000000008: 0000 illegal > > 000000000000000a: 0000 illegal > > 000000000000000c: 0000 illegal > > 000000000000000e: 0000 illegal > > Call Trace: > > [<0000000000000002>] 0x2 > > [<00000000009d5cde>] bpf_struct_ops_test_run+0x156/0x250 > > [<000000000033145a>] __sys_bpf+0xa1a/0xd00 > > [<00000000003319dc>] __s390x_sys_bpf+0x44/0x50 > > [<0000000000c4382c>] __do_syscall+0x244/0x300 > > [<0000000000c59a40>] system_call+0x70/0x98 > > INFO: lockdep is turned off. > > Last Breaking-Event-Address: > > [<000003ff800a49ac>] bpf_prog_83ebf4e90b071f2b_test_1+0x5c/0x70 > > Kernel panic - not syncing: Fatal exception: panic_on_oops > > > > > > [1] > > https://github.com/kernel-patches/bpf/actions/runs/8297891630/job/22710267571 > > Hi, > > thanks for the report, I'm currently looking into it. > > The problem is that the following: > > static void bpf_jit_plt(void *plt, void *ret, void *target) > { > memcpy(plt, bpf_plt, BPF_PLT_SIZE); > *(void **)((char *)plt + (bpf_plt_ret - bpf_plt)) = ret; > *(void **)((char *)plt + (bpf_plt_target - bpf_plt)) = target ?: ret; > } > > is (mis)compiled to the following: > > 0x000000000017a26e <+238>: stg %r4,232(%r3,%r15) > 0x000000000017a274 <+244>: ltgr %r11,%r11 > 0x000000000017a278 <+248>: locgrne %r4,%r11 > 0x000000000017a27c <+252>: stg %r4,232(%r1,%r15) > 0x000000000017a282 <+258>: la %r3,232(%r1,%r15) > 0x000000000017a286 <+262>: lghi %r4,8 > 0x000000000017a28a <+266>: mvc 232(16,%r15),0(%r2) > 0x000000000017a290 <+272>: mvc 248(16,%r15),16(%r2) > > Note that memcpy() and assignments are reordered, and this is causing > the crash. The C code violates the strict aliasing rules, but this > should be irrelevant for the kernel. > > The problem does not occur with the latest GCC. Bisect says that the > following commit fixed it: > > commit 2e96b5f14e4025691b57d2301d71aa6092ed44bc > Author: Aldy Hernandez <aldyh@xxxxxxxxxx> > Date: Tue Jun 15 12:32:51 2021 +0200 > > Backwards jump threader rewrite with ranger. > > but I think that's accidental. The reordering is done by the sched1 > pass. I plan to find out why it thinks it's okay to reorder these > memory accesses. Wow. What a discovery. Does this bug affect other architectures ? Probably yes, since it's sched1 pass ? Should we apply some workaround for now? Also it's probably not strict aliasing related: extern const char bpf_plt[]; extern const char bpf_plt_ret[]; extern const char bpf_plt_target[]; and bpf_jit_plt() casts the pointer to 'char *'. The 'char *' was always special from GCC strict aliasing point of view. iirc gcc used to treat such pointers as pointing anywhere.