czw., 28 sty 2021 o 15:32 Marcin Ślusarz <marcin.slusarz@xxxxxxxxx> napisał(a): > > czw., 28 sty 2021 o 13:39 Rafael J. Wysocki <rafael@xxxxxxxxxx> napisał(a): > > The only explanation for that I can think about (and which does not > > involve supernatural intervention so to speak) is a stack corruption > > occurring between these two calls in sdw_intel_acpi_cb(). IOW, > > something scribbles on the handle in the meantime, but ATM I have no > > idea what that can be. > > I tried KASAN but it didn't find anything and kernel actually booted > successfully. I investigated this and it looks like a compiler bug (or something nastier), but I can't find where exactly registers get corrupted because if I add printks the corruption seems on the printk side, but if I don't add them it seems the value gets corrupted earlier. Here's what I tried: 1) If I add printk(KERN_INFO "%s handle: %p\n", __func__, handle); to acpi_ns_validate_handle before and after call to ACPI_GET_DESCRIPTOR_TYPE I get this: acpi_ns_validate_handle handle: 00000000433f39ec BUG: kernel NULL pointer dereference, address: 0000000000000050 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 3 PID: 473 Comm: systemd-udevd Tainted: G W E 5.10.11+ #13 Hardware name: HP HP Pavilion Laptop 15-cs3xxx/86E2, BIOS F.05 01/01/2020 RIP: 0010:acpi_ns_validate_handle+0x39/0x62 Code: 49 8d 44 24 ff 48 83 f8 fd 76 09 4c 8b 25 4a b4 89 01 eb 39 4c 89 e2 48 c7 c6 c0 a4 6c 8f 48 c7 c7 46 6d 94 8f e8 eb 87 31 00 <41> 80 7c 24 08 0f 75 18 4c 89 e2 48 c7 c6 c0 a4 6c 8f 48 c7 c7 46 RSP: 0018:ffff9f488060bb00 EFLAGS: 00010246 RAX: 0000000000000030 RBX: ffff9f488060bb68 RCX: ffff889d5fad8a08 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff889d5fad8a00 RBP: 0000000000000048 R08: 0000000000000000 R09: ffff9f488060b928 R10: ffff9f488060b920 R11: ffffffff8fcdc3f8 R12: 0000000000000048 R13: 0000000000000000 R14: ffffffff8ed400f0 R15: 0000000000000000 FS: 00007f811fdbe8c0(0000) GS:ffff889d5fac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000050 CR3: 000000026877c001 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: acpi_get_data_full+0x81/0xdf acpi_bus_get_device+0x32/0xa0 sdw_intel_acpi_scan.cold+0x23/0x21c [soundwire_intel] snd_intel_dsp_driver_probe.cold+0x187/0x1b2 [snd_intel_dspcfg] azx_probe+0x7a/0x970 [snd_hda_intel] local_pci_probe+0x42/0x80 ? _cond_resched+0x16/0x40 pci_device_probe+0xfd/0x1b0 really_probe+0xf2/0x440 driver_probe_device+0xe1/0x150 device_driver_attach+0xa1/0xb0 __driver_attach+0x8a/0x150 ? device_driver_attach+0xb0/0xb0 ? device_driver_attach+0xb0/0xb0 bus_for_each_dev+0x78/0xc0 bus_add_driver+0x12b/0x1e0 driver_register+0x8b/0xe0 ? 0xffffffffc128a000 do_one_initcall+0x44/0x1d0 ? do_init_module+0x23/0x260 ? kmem_cache_alloc_trace+0xf5/0x200 do_init_module+0x5c/0x260 Generic FE-GE Realtek PHY r8169-1000:00: attached PHY driver [Generic FE-GE Realtek PHY] (mii_bus:phy_addr=r8169-1000:00, irq=IGNORE) __do_sys_finit_module+0xb1/0x110 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 $ gdb vmlinux (...) (gdb) disassemble acpi_ns_validate_handle Dump of assembler code for function acpi_ns_validate_handle: 0xffffffff815781cd <+0>: call 0xffffffff81062c40 <__fentry__> 0xffffffff815781d2 <+5>: push %r12 0xffffffff815781d4 <+7>: mov %rdi,%r12 0xffffffff815781d7 <+10>: call 0xffffffff81584a44 <acpi_ut_track_stack_ptr> 0xffffffff815781dc <+15>: lea -0x1(%r12),%rax 0xffffffff815781e1 <+20>: cmp $0xfffffffffffffffd,%rax 0xffffffff815781e5 <+24>: jbe 0xffffffff815781f0 <acpi_ns_validate_handle+35> 0xffffffff815781e7 <+26>: mov 0x189b44a(%rip),%r12 # 0xffffffff82e13638 <acpi_gbl_root_node> 0xffffffff815781ee <+33>: jmp 0xffffffff81578229 <acpi_ns_validate_handle+92> 0xffffffff815781f0 <+35>: mov %r12,%rdx 0xffffffff815781f3 <+38>: mov $0xffffffff81eca4c0,%rsi 0xffffffff815781fa <+45>: mov $0xffffffff82146d46,%rdi 0xffffffff81578201 <+52>: call 0xffffffff818909f1 <printk> 0xffffffff81578206 <+57>: cmpb $0xf,0x8(%r12) 0xffffffff8157820c <+63>: jne 0xffffffff81578226 <acpi_ns_validate_handle+89> 0xffffffff8157820e <+65>: mov %r12,%rdx 0xffffffff81578211 <+68>: mov $0xffffffff81eca4c0,%rsi 0xffffffff81578218 <+75>: mov $0xffffffff82146d46,%rdi 0xffffffff8157821f <+82>: call 0xffffffff818909f1 <printk> 0xffffffff81578224 <+87>: jmp 0xffffffff81578229 <acpi_ns_validate_handle+92> 0xffffffff81578226 <+89>: xor %r12d,%r12d 0xffffffff81578229 <+92>: mov %r12,%rax 0xffffffff8157822c <+95>: pop %r12 0xffffffff8157822e <+97>: ret End of assembler dump. (...) (gdb) print (char *)0xffffffff82146d46 $2 = 0xffffffff82146d46 "\001\066%s handle: %p\n" (gdb) print (char *)0xffffffff81eca4c0 $3 = 0xffffffff81eca4c0 <__func__.3> "acpi_ns_validate_handle" The crash is on cmpb $0xf,0x8(%r12). %r12 is 0x48, so 0x48 + 0x8 == 0x50 and that's the address we are crashing on. However, how can %r12 become 0x48 when few instructions above we called printk and the value we see in the kernel log is different? This probably means that printk is corrupting it... (or it's a CPU bug, yeah right, but with my luck for such issues who knows ;) I don't dare to debug printk. 2) Without printks I get this: BUG: kernel NULL pointer dereference, address: 0000000000000050 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 500 Comm: systemd-udevd Tainted: G W E 5.10.11+ #14 Hardware name: HP HP Pavilion Laptop 15-cs3xxx/86E2, BIOS F.05 01/01/2020 RIP: 0010:acpi_ns_validate_handle+0x23/0x34 Code: 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 41 54 49 89 fc e8 8d c7 00 00 49 8d 44 24 ff 48 83 f8 fd 76 09 4c 8b 25 4a b4 89 01 eb 0b <41> 80 7c 24 08 0f 74 03 45 31 e4 4c 89 e0 41 5c c3 0f 1f 44 00 00 RSP: 0018:ffffbfc0c08dfb08 EFLAGS: 00010213 RAX: 0000000000000047 RBX: ffffbfc0c08dfb70 RCX: 0000000000000000 RDX: ffffffffc11204b1 RSI: 0000000000000246 RDI: 0000000000000048 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffffc11204b1 R11: 0000000000000001 R12: 0000000000000048 R13: ffffffffa37400f0 R14: 0000000000000048 R15: 0000000000000000 FS: 00007fbfdfd478c0(0000) GS:ffff9cd01fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000050 CR3: 00000001401de003 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: acpi_get_data_full+0x4d/0x92 acpi_bus_get_device+0x32/0xa0 sdw_intel_acpi_scan+0x59/0x230 [soundwire_intel] ? strstr+0x22/0x60 snd_intel_dsp_driver_probe.cold+0xaf/0x163 [snd_intel_dspcfg] azx_probe+0x7a/0x970 [snd_hda_intel] local_pci_probe+0x42/0x80 ? _cond_resched+0x16/0x40 pci_device_probe+0xfd/0x1b0 really_probe+0xf2/0x440 driver_probe_device+0xe1/0x150 device_driver_attach+0xa1/0xb0 __driver_attach+0x8a/0x150 ? device_driver_attach+0xb0/0xb0 ? device_driver_attach+0xb0/0xb0 bus_for_each_dev+0x78/0xc0 bus_add_driver+0x12b/0x1e0 driver_register+0x8b/0xe0 ? 0xffffffffc1112000 do_one_initcall+0x44/0x1d0 ? do_init_module+0x23/0x260 ? kmem_cache_alloc_trace+0xf5/0x200 do_init_module+0x5c/0x260 __do_sys_finit_module+0xb1/0x110 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 $ gdb vmlinux (...) (gdb) disassemble acpi_ns_validate_handle Dump of assembler code for function acpi_ns_validate_handle: 0xffffffff815781cd <+0>: call 0xffffffff81062c40 <__fentry__> 0xffffffff815781d2 <+5>: push %r12 0xffffffff815781d4 <+7>: mov %rdi,%r12 0xffffffff815781d7 <+10>: call 0xffffffff81584969 <acpi_ut_track_stack_ptr> 0xffffffff815781dc <+15>: lea -0x1(%r12),%rax 0xffffffff815781e1 <+20>: cmp $0xfffffffffffffffd,%rax 0xffffffff815781e5 <+24>: jbe 0xffffffff815781f0 <acpi_ns_validate_handle+35> 0xffffffff815781e7 <+26>: mov 0x189b44a(%rip),%r12 # 0xffffffff82e13638 <acpi_gbl_root_node> 0xffffffff815781ee <+33>: jmp 0xffffffff815781fb <acpi_ns_validate_handle+46> 0xffffffff815781f0 <+35>: cmpb $0xf,0x8(%r12) 0xffffffff815781f6 <+41>: je 0xffffffff815781fb <acpi_ns_validate_handle+46> 0xffffffff815781f8 <+43>: xor %r12d,%r12d 0xffffffff815781fb <+46>: mov %r12,%rax 0xffffffff815781fe <+49>: pop %r12 0xffffffff81578200 <+51>: ret End of assembler dump. Again, we crash on cmpb $0xf,0x8(%r12). %r12 == 0x48, so again we crash on address 0x50. Why is %r12 == 0x48? It was either corrupted by acpi_ut_track_stack_ptr or before we even entered this function. It's weird that the value is exactly the same when corruption has a different origin... 3) Without printks and with ACPI_DEBUG=n BUG: kernel NULL pointer dereference, address: 0000000000000050 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 7 PID: 473 Comm: systemd-udevd Tainted: G W E 5.10.11+ #15 Hardware name: HP HP Pavilion Laptop 15-cs3xxx/86E2, BIOS F.05 01/01/2020 RIP: 0010:acpi_ns_validate_handle+0x1a/0x23 Code: 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 48 8d 57 ff 48 89 f8 48 83 fa fd 76 08 48 8b 05 8c 28 6a 01 c3 <80> 7f 08 0f 74 02 31 c0 c3 0f 1f 44 00 00 48 8b 3d 76 28 6a 01 e8 RSP: 0018:ffffa96ac0847b20 EFLAGS: 00010213 RAX: 0000000000000048 RBX: ffffa96ac0847b70 RCX: 0000000000000000 RDX: 0000000000000047 RSI: 0000000000000246 RDI: 0000000000000048 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffffc10ee4b1 R11: ffffffffa14e3268 R12: 0000000000001001 R13: ffffffffa073f150 R14: 0000000000000048 R15: 0000000000000000 FS: 00007f5a27c588c0(0000) GS:ffff91555fbc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000050 CR3: 000000010ab64004 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: acpi_get_data_full+0x4d/0x92 acpi_bus_get_device+0x26/0x50 sdw_intel_acpi_scan+0x59/0x230 [soundwire_intel] ? strstr+0x22/0x60 snd_intel_dsp_driver_probe.cold+0xaf/0x163 [snd_intel_dspcfg] azx_probe+0x7a/0x970 [snd_hda_intel] local_pci_probe+0x42/0x80 ? _cond_resched+0x16/0x40 pci_device_probe+0xfd/0x1b0 really_probe+0xf2/0x440 driver_probe_device+0xe1/0x150 device_driver_attach+0xa1/0xb0 __driver_attach+0x8a/0x150 ? device_driver_attach+0xb0/0xb0 ? device_driver_attach+0xb0/0xb0 bus_for_each_dev+0x78/0xc0 bus_add_driver+0x12b/0x1e0 driver_register+0x8b/0xe0 ? 0xffffffffc1217000 do_one_initcall+0x44/0x1d0 ? do_init_module+0x23/0x260 ? kmem_cache_alloc_trace+0xf5/0x200 do_init_module+0x5c/0x260 __do_sys_finit_module+0xb1/0x110 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 $ gdb vmlinux (...) (gdb) disassemble acpi_ns_validate_handle Dump of assembler code for function acpi_ns_validate_handle: 0xffffffff8156518b <+0>: call 0xffffffff81062c40 <__fentry__> 0xffffffff81565190 <+5>: lea -0x1(%rdi),%rdx 0xffffffff81565194 <+9>: mov %rdi,%rax 0xffffffff81565197 <+12>: cmp $0xfffffffffffffffd,%rdx 0xffffffff8156519b <+16>: jbe 0xffffffff815651a5 <acpi_ns_validate_handle+26> 0xffffffff8156519d <+18>: mov 0x16a288c(%rip),%rax # 0xffffffff82c07a30 <acpi_gbl_root_node> 0xffffffff815651a4 <+25>: ret 0xffffffff815651a5 <+26>: cmpb $0xf,0x8(%rdi) 0xffffffff815651a9 <+30>: je 0xffffffff815651ad <acpi_ns_validate_handle+34> 0xffffffff815651ab <+32>: xor %eax,%eax 0xffffffff815651ad <+34>: ret End of assembler dump. Again, we crash on cmpb $0xf,0x8(%rdi). %rdi == 0x48, it can only come from outside. All logs (+result of some gdb commands I didn't include here) can be found here: https://people.freedesktop.org/~mslusarz/tmp/ I'm using gcc 10.2.1 from Debian testing. I hope I didn't make a fatal mistake in my analysis...^W^W^W^W^W^W^W^W^W^W^W^W I hope I made a mistake in my analysis and the bug can be explained easily ;) Marcin