6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
another release cycle, and another regression.
After another kernel update in Fedora Rawhide GPU not
entering in graphic mode on my laptop ASUS ROG Strix G15 Advantage
Edition G513QY-HQ007.
And in kernel log appears this bug trace:
[   22.574698] ==================================================================
[   22.574704] BUG: KASAN: null-ptr-deref in
amdgpu_ras_reset_error_count+0x2d6/0x3e0 [amdgpu]
[   22.575115] Read of size 4 at addr 0000000000000180 by task (udev-worker)/504

[   22.575125] CPU: 2 PID: 504 Comm: (udev-worker) Tainted: G        W
   L     6.6.0-last-d2f51b3516dade79269ff45eae2a7668ae711b25+ #163
[   22.575135] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.331 02/24/2023
[   22.575143] Call Trace:
[   22.575147]  <TASK>
[   22.575151]  dump_stack_lvl+0x76/0xd0
[   22.575158]  kasan_report+0xa6/0xe0
[   22.575165]  ? amdgpu_ras_reset_error_count+0x2d6/0x3e0 [amdgpu]
[   22.575320]  kasan_check_range+0x105/0x1b0
[   22.575320]  amdgpu_ras_reset_error_count+0x2d6/0x3e0 [amdgpu]
[   22.575320]  gmc_v9_0_late_init+0xcf/0x1b0 [amdgpu]
[   22.575320]  amdgpu_device_ip_late_init+0x103/0x7b0 [amdgpu]
[   22.575320]  amdgpu_device_init+0x7b33/0x8a90 [amdgpu]
[   22.575320]  ? __pfx_amdgpu_device_init+0x10/0x10 [amdgpu]
[   22.575320]  ? __pfx_pci_bus_read_config_word+0x10/0x10
[   22.575320]  ? do_pci_enable_device+0x22d/0x2a0
[   22.575320]  ? __pfx_pci_request_acs+0x1/0x10
[   22.575320]  ? _raw_spin_unlock_irqrestore+0x66/0x80
[   22.575320]  ? lockdep_hardirqs_on+0x81/0x110
[   22.575320]  ? __kasan_check_byte+0x13/0x50
[   22.575320]  amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu]
[   22.575320]  amdgpu_pci_probe+0x282/0xac0 [amdgpu]
[   22.575320]  ? __pfx_amdgpu_pci_probe+0x10/0x10 [amdgpu]
[   22.575320]  local_pci_probe+0xdd/0x190
[   22.575320]  pci_device_probe+0x23a/0x780
[   22.575320]  ? kernfs_add_one+0x326/0x490
[   22.575320]  ? kernfs_get.part.0+0x4c/0x70
[   22.575320]  ? __pfx_pci_device_probe+0x10/0x10
[   22.575320]  ? kernfs_create_link+0x16b/0x230
[   22.575320]  ? kernfs_put+0x1c/0x40
[   22.575320]  ? sysfs_do_create_link_sd+0x8e/0x100
[   22.575320]  really_probe+0x3e2/0xb80
[   22.575320]  __driver_probe_device+0x18c/0x450
[   22.575320]  driver_probe_device+0x4a/0x120
[   22.575320]  __driver_attach+0x1e5/0x4a0
[   22.575320]  ? __pfx___driver_attach+0x10/0x10
[   22.575320]  bus_for_each_dev+0x109/0x190
[   22.575320]  ? __pfx_bus_for_each_dev+0x10/0x10
[   22.575320]  bus_add_driver+0x2a1/0x570
[   22.575320]  driver_register+0x134/0x460
[   22.575320]  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
[   22.575320]  do_one_initcall+0xd6/0x430
[   22.575320]  ? __pfx_do_one_initcall+0x10/0x10
[   22.575320]  ? kasan_unpoison+0x44/0x70
[   22.575320]  do_init_module+0x238/0x770
[   22.575320]  load_module+0x5581/0x6f10
[   22.575320]  ? __pfx_load_module+0x10/0x10
[   22.575320]  ? ima_post_read_file+0x189/0x1b0
[   22.575320]  ? __pfx_ima_post_read_file+0x10/0x10
[   22.575320]  ? __pfx_bpf_lsm_kernel_post_read_file+0x10/0x10
[   22.575320]  ? kernel_read_file+0x243/0x820
[   22.575320]  ? __pfx_kernel_read_file+0x10/0x10
[   22.575320]  ? init_module_from_file+0xd1/0x130
[   22.575320]  init_module_from_file+0xd1/0x130
[   22.575320]  ? __pfx_init_module_from_file+0x10/0x10
[   22.575320]  ? local_clock_noinstr+0x45/0xc0
[   22.575320]  ? do_raw_spin_unlock+0x58/0x1f0
[   22.575320]  idempotent_init_module+0x235/0x650
[   22.575320]  ? __pfx_idempotent_init_module+0x10/0x10
[   22.575320]  ? __pfx_bpf_lsm_capable+0x10/0x10
[   22.575320]  ? security_capable+0x74/0xb0
[   22.575320]  __x64_sys_finit_module+0xbe/0x130
[   22.575320]  do_syscall_64+0x64/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? lockdep_hardirqs_on+0x81/0x110
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? lockdep_hardirqs_on+0x81/0x110
[   22.575320]  ? do_syscall_64+0x70/0xe0
[   22.575320]  ? lockdep_hardirqs_on+0x81/0x110
[   22.575320]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[   22.575320] RIP: 0033:0x7f8ab56bbf8d
[   22.575320] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e
fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 4e 0c 00 f7 d8 64 89
01 48
[   22.575320] RSP: 002b:00007ffe2e836608 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[   22.575320] RAX: ffffffffffffffda RBX: 000055f55ef37f30 RCX: 00007f8ab56bbf8d
[   22.575320] RDX: 0000000000000000 RSI: 000055f55ef10950 RDI: 0000000000000015
[   22.575320] RBP: 00007ffe2e8366c0 R08: 0000000000000000 R09: 00007ffe2e836650
[   22.575320] R10: 0000000000000015 R11: 0000000000000246 R12: 000055f55ef10950
[   22.575320] R13: 0000000000020000 R14: 000055f55ef37240 R15: 000055f55ef393d0
[   22.575320]  </TASK>
[   22.575320] ==================================================================


Using bisect, I found out that this commit is to blame
❯ git bisect good
73582be11ac8f6d6765e185bf48f22efb9d28c3b is the first bad commit
commit 73582be11ac8f6d6765e185bf48f22efb9d28c3b
Author: Tao Zhou <tao.zhou1@xxxxxxx>
Date:   Thu Oct 12 14:33:37 2023 +0800

    drm/amdgpu: bypass RAS error reset in some conditions

    PMFW is responsible for RAS error reset in some conditions, driver can
    skip the operation.

    v2: add check for ras->in_recovery, it's set earlier than
    amdgpu_in_reset.

    v3: fix error in gpu reset check.

    Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
    Reviewed-by: Hawking Zhang <Hawking.Zhang@xxxxxxx>
    Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

I rebuilt the kernel from master with reverted
73582be11ac8f6d6765e185bf48f22efb9d28c3b and my laptop started working
again.

All kernel logs and build config I attached below.
Laptop hardware probe is here: https://linux-hardware.org/?probe=85a38e7906

-- 
Best Regards,
Mike Gavrilov.

<<attachment: bisect-all-steps-dmesg.zip>>

<<attachment: dmesg-6.6.0-last-d2f51b3516dade79269ff45eae2a7668ae711b25.zip>>

Attachment: .config.zip
Description: Zip archive


[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux