Hi Kazu, On 2022/12/5 9:05, HAGIO KAZUHITO(萩尾 一仁) wrote:
On 2022/12/02 17:31, dinghui wrote:On 2022/12/2 15:44, HAGIO KAZUHITO(萩尾 一仁) wrote:On 2022/12/01 16:01, Ding Hui wrote:We met "bt" cmd on KASAN kernel vmcore display truncated backtraces like this: crash> bt PID: 4131 TASK: ffff8001521df000 CPU: 3 COMMAND: "bash" #0 [ffff2000224b0cb0] machine_kexec_prepare at ffff2000200bff4c After digging the root cause, it turns out that arm64_in_kdump_text() found wrong bt->bptr at "machine_kexec" branch. Disassemble machine_kexec() of KASAN vmlinux (gcc 7.3.0): crash> dis -x machine_kexec 0xffff2000200bff50 <machine_kexec>: stp x29, x30, [sp,#-208]! 0xffff2000200bff54 <machine_kexec+0x4>: mov x29, sp 0xffff2000200bff58 <machine_kexec+0x8>: stp x19, x20, [sp,#16] 0xffff2000200bff5c <machine_kexec+0xc>: str x24, [sp,#56] 0xffff2000200bff60 <machine_kexec+0x10>: str x26, [sp,#72] 0xffff2000200bff64 <machine_kexec+0x14>: mov x2, #0x8ab3 0xffff2000200bff68 <machine_kexec+0x18>: add x1, x29, #0x70 0xffff2000200bff6c <machine_kexec+0x1c>: lsr x1, x1, #3 0xffff2000200bff70 <machine_kexec+0x20>: movk x2, #0x41b5, lsl #16 0xffff2000200bff74 <machine_kexec+0x24>: mov x19, #0x200000000000 0xffff2000200bff78 <machine_kexec+0x28>: adrp x3, 0xffff2000224b0000 0xffff2000200bff7c <machine_kexec+0x2c>: movk x19, #0xdfff, lsl #48 0xffff2000200bff80 <machine_kexec+0x30>: add x3, x3, #0xcb0 0xffff2000200bff84 <machine_kexec+0x34>: add x4, x1, x19 0xffff2000200bff88 <machine_kexec+0x38>: stp x2, x3, [x29,#112] 0xffff2000200bff8c <machine_kexec+0x3c>: adrp x2, 0xffff2000200bf000 <swsusp_arch_resume+0x1e8> 0xffff2000200bff90 <machine_kexec+0x40>: add x2, x2, #0xf50 0xffff2000200bff94 <machine_kexec+0x44>: str x2, [x29,#128] 0xffff2000200bff98 <machine_kexec+0x48>: mov w2, #0xf1f1f1f1 0xffff2000200bff9c <machine_kexec+0x4c>: str w2, [x1,x19] 0xffff2000200bffa0 <machine_kexec+0x50>: mov w2, #0xf200 0xffff2000200bffa4 <machine_kexec+0x54>: mov w1, #0xf3f3f3f3 0xffff2000200bffa8 <machine_kexec+0x58>: movk w2, #0xf2f2, lsl #16 0xffff2000200bffac <machine_kexec+0x5c>: stp w2, w1, [x4,#4] We notice that: 1. machine_kexec() start address is 0xffff2000200bff50 2. the instruction at machine_kexec+0x44 store the same value 0xffff2000200bff50 (comes from 0xffff2000200bf000 + 0xf50) into stack postion [x29,#128]. When arm64_in_kdump_text() search LR from stack, it met 0xffff2000200bff50 firstly, so got wrong bt->bptr. We know that the real LR is always great than the start addressSeems true. One question, do you see which kernel code stores that value?Actually, there is no C code stores that value. The source code like this: void machine_kexec(struct kimage *kimage) { phys_addr_t reboot_code_buffer_phys; void *reboot_code_buffer; bool in_kexec_crash = (kimage == kexec_crash_image); bool stuck_cpus = cpus_are_stuck_in_kernel(); BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1))); WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()), "Some CPUs may be stale, kdump will be unreliable.\n"); ... The point is CONFIG_KASAN=y I compared the gcc args when compiling machine_kexec.o between kasan eanble [1] and kasan enable but set KASAN_SANITIZE_machine_kexec.o := n [2], the difference is: [1]: -fsanitize=kernel-address -fasan-shadow-offset=0xdfff200000000000 --param asan-globals=1 --param asan-instrumentation-with-call-threshold=10000 --param asan-stack=1 [2]: -fno-builtin If I remove `--param asan-stack=1` but keep other asan args to compile machine_kexec.o, those assembly statement disappear.I see, thanks. I can see the similar pattern with CONFIG_KASAN=y also on x86_64, which stores the function start address and uses 0xf1f1f1f1 (ASAN_STACK_MAGIC_LEFT in gcc) and etc. (gdb) disas machine_kexec Dump of assembler code for function machine_kexec: 0xffffffff8109b1c0 <+0>: callq 0xffffffff81099e60 <__fentry__ ... 0xffffffff8109b208 <+72>: movq $0xffffffff8109b1c0,0x20(%rsp) 0xffffffff8109b211 <+81>: add %r12,%rax 0xffffffff8109b214 <+84>: movl $0xf1f1f1f1,(%rax) (gdb) disas crash_save_cpu Dump of assembler code for function crash_save_cpu: 0xffffffff8126e7e0 <+0>: callq 0xffffffff81099e60 <__fentry__> ... 0xffffffff8126e817 <+55>: movq $0xffffffff8126e7e0,0x10(%rsp) 0xffffffff8126e820 <+64>: add %rbp,%rax 0xffffffff8126e823 <+67>: movl $0xf1f1f1f1,(%rax) I wondered whether excluding only their start address was enough to fix the issue, but now it seems ok to me.
I found some description about asan-stack at here: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/asan.cc;h=dc7b7f4bcf1803dd2ffbbaad782cf1b515d61ed8;hb=HEAD#l156 139 The 32 bytes of LEFT red zone at the bottom of the stack can be 140 decomposed as such: ...156 3/ The following 8 bytes contain the PC of the current function which
157 will be used by the run-time library to print an error message.
Acked-by: Kazuhito Hagio <k-hagio-ab@xxxxxxx> Let's wait for Lianbo's test and review. Thanks, Kazu
-- Thanks, - Ding Hui -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/crash-utility Contribution Guidelines: https://github.com/crash-utility/crash/wiki