[Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.

Mikhail Gavrilov <mikhail.v.gavrilov@xxxxxxxxx> · Tue, 28 Jun 2022 14:21:02 +0500

Hi guys.
Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in
graphic mode instead I see black screen with constantly glowing
cursor. Demonstration: https://youtu.be/rGL4LsHMae4
In the kernel logs there are references to hung processes:
[  149.363465] rfkill: input handler disabled
[  249.072478] INFO: task (brt-dbus):1645 blocked for more than 122 seconds.
[  249.072515]       Tainted: G        W    L   --------  ---
5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1
[  249.072520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.072524] task:(brt-dbus)      state:D stack:14384 pid: 1645
ppid:     1 flags:0x00000002
[  249.072536] Call Trace:
[  249.072540]  <TASK>
[  249.072551]  __schedule+0x492/0x1640
[  249.072560]  ? lock_is_held_type+0xe8/0x140
[  249.072569]  ? find_held_lock+0x32/0x80
[  249.072584]  schedule+0x4e/0xb0
[  249.072591]  schedule_preempt_disabled+0x14/0x20
[  249.072597]  __mutex_lock+0x423/0x890
[  249.072608]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.072818]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.073010]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.073207]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.074088]  filp_close+0x31/0x70
[  249.074097]  __close_range+0x130/0x320
[  249.074108]  __x64_sys_close_range+0x13/0x20
[  249.074113]  do_syscall_64+0x5b/0x80
[  249.074120]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.074127]  ? do_syscall_64+0x67/0x80
[  249.074135]  ? do_syscall_64+0x67/0x80
[  249.074140]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.074147]  ? do_syscall_64+0x67/0x80
[  249.074154]  ? lock_is_held_type+0xe8/0x140
[  249.074164]  ? asm_exc_page_fault+0x27/0x30
[  249.074171]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.074178]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.074184] RIP: 0033:0x7fd71f54f97b
[  249.074208] RSP: 002b:00007fffc8e752a8 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.074215] RAX: ffffffffffffffda RBX: 00007fffc8e752b0 RCX: 00007fd71f54f97b
[  249.074220] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000027
[  249.074224] RBP: 00007fffc8e75330 R08: 0000000000000000 R09: 00007fffc8e75380
[  249.074228] R10: 00007fffc8e751f0 R11: 0000000000000246 R12: 0000000000000002
[  249.074232] R13: 00007fffc8e75340 R14: 0000000000000000 R15: 0000000000000002
[  249.074252]  </TASK>
[  249.074261] INFO: task (ostnamed):1718 blocked for more than 122 seconds.
[  249.074266]       Tainted: G        W    L   --------  ---
5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1
[  249.074285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.074289] task:(ostnamed)      state:D stack:14552 pid: 1718
ppid:     1 flags:0x00000006
[  249.074299] Call Trace:
[  249.074302]  <TASK>
[  249.074310]  __schedule+0x492/0x1640
[  249.074316]  ? lock_is_held_type+0xe8/0x140
[  249.074324]  ? find_held_lock+0x32/0x80
[  249.074339]  schedule+0x4e/0xb0
[  249.074346]  schedule_preempt_disabled+0x14/0x20
[  249.074352]  __mutex_lock+0x423/0x890
[  249.074361]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.074564]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.074754]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.074950]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.075133]  filp_close+0x31/0x70
[  249.075140]  __close_range+0x130/0x320
[  249.075150]  __x64_sys_close_range+0x13/0x20
[  249.075154]  do_syscall_64+0x5b/0x80
[  249.075164]  ? lock_is_held_type+0xe8/0x140
[  249.075175]  ? do_syscall_64+0x67/0x80
[  249.075180]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.075187]  ? do_syscall_64+0x67/0x80
[  249.075194]  ? lock_is_held_type+0xe8/0x140
[  249.075204]  ? asm_exc_page_fault+0x27/0x30
[  249.075210]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.075217]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.075222] RIP: 0033:0x7fd71f54f97b
[  249.075231] RSP: 002b:00007fffc8e752a8 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.075237] RAX: ffffffffffffffda RBX: 00007fffc8e752b0 RCX: 00007fd71f54f97b
[  249.075241] RDX: 0000000000000000 RSI: 00000000000000b9 RDI: 0000000000000027
[  249.075245] RBP: 00007fffc8e75330 R08: 0000000000000000 R09: 00007fffc8e75380
[  249.075249] R10: 00007fffc8e751f0 R11: 0000000000000246 R12: 0000000000000004
[  249.075253] R13: 00007fffc8e75340 R14: 0000000000000000 R15: 0000000000000003
[  249.075289]  </TASK>
[  249.075294] INFO: task (pcscd):1749 blocked for more than 122 seconds.
[  249.075298]       Tainted: G        W    L   --------  ---
5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1
[  249.075302] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.075306] task:(pcscd)         state:D stack:14256 pid: 1749
ppid:     1 flags:0x00000002
[  249.075314] Call Trace:
[  249.075318]  <TASK>
[  249.075325]  __schedule+0x492/0x1640
[  249.075331]  ? lock_is_held_type+0xe8/0x140
[  249.075339]  ? find_held_lock+0x32/0x80
[  249.075353]  schedule+0x4e/0xb0
[  249.075360]  schedule_preempt_disabled+0x14/0x20
[  249.075365]  __mutex_lock+0x423/0x890
[  249.075375]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.075574]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.075764]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.075960]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.076143]  filp_close+0x31/0x70
[  249.076150]  __close_range+0x130/0x320
[  249.076160]  __x64_sys_close_range+0x13/0x20
[  249.076164]  do_syscall_64+0x5b/0x80
[  249.076169]  ? do_syscall_64+0x67/0x80
[  249.076175]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.076182]  ? do_syscall_64+0x67/0x80
[  249.076188]  ? do_syscall_64+0x67/0x80
[  249.076194]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.076201]  ? do_syscall_64+0x67/0x80
[  249.076206]  ? do_syscall_64+0x67/0x80
[  249.076211]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.076218]  ? do_syscall_64+0x67/0x80
[  249.076223]  ? lock_is_held_type+0xe8/0x140
[  249.076233]  ? asm_exc_page_fault+0x27/0x30
[  249.076239]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.076246]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.076251] RIP: 0033:0x7fd71f54f97b
[  249.076259] RSP: 002b:00007fffc8e752a8 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.076265] RAX: ffffffffffffffda RBX: 00007fffc8e752b0 RCX: 00007fd71f54f97b
[  249.076287] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 000000000000004c
[  249.076291] RBP: 00007fffc8e75330 R08: 0000000000000000 R09: 00007fffc8e75380
[  249.076295] R10: 00007fffc8e751f0 R11: 0000000000000246 R12: 0000000000000003
[  249.076300] R13: 00007fffc8e75340 R14: 0000000000000000 R15: 0000000000000003
[  249.076319]  </TASK>
[  249.076323]
               Showing all locks held in the system:
[  249.076335] 1 lock held by khungtaskd/183:
[  249.076340]  #0: ffffffff84169060 (rcu_read_lock){....}-{1:2}, at:
debug_show_all_locks+0x15/0x16b
[  249.076364] 3 locks held by systemd-journal/868:
[  249.076376] 3 locks held by gnome-shell/1626:
[  249.076380]  #0: ffff9f2b248e4680
(&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x3c/0x880
[  249.076394]  #1: ffff9f2b248e4728
(&sig->exec_update_lock){++++}-{3:3}, at: begin_new_exec+0x384/0xcc0
[  249.076407]  #2: ffff9f2b3a95ec58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.076609] 1 lock held by (brt-dbus)/1645:
[  249.076613]  #0: ffff9f2b3a95ec58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.076814] 1 lock held by (ostnamed)/1718:
[  249.076818]  #0: ffff9f2b3a95ec58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.077018] 1 lock held by (pcscd)/1749:
[  249.077022]  #0: ffff9f2b3a95ec58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]

[  249.077226] =============================================

[  335.093113] kworker/dying (297) used greatest stack depth: 11608 bytes left
[  335.093254] kworker/dying (241) used greatest stack depth: 11360 bytes left
Full kernel log is here: https://pastebin.com/0YHs6wyB

Naturally, I tried to find the problematic commit via git bisect. It
was the longest bisect in my life, I needed to collect the core 565
times and it took three weeks. This is what explains why I am writing
only now, and not immediately. The most annoying thing is that it
looks like I wasted three weeks because the exact commit was never
found. My bisect log can be found here: https://pastebin.com/AhLMNfyv

If you open it you will see a lot of skip steps. This is due to the
fact that in these steps I observe a problem when loading the kernel
hangs on the messages on screen:
[drm] amdgpu kernel modesetting enabled.
amdgpu: Ignoring ACPI CRAT on non-APU system
amdgpu: Virtual CRAT table created for CPU
amdgpu: Topology: Add CPU node
Here is photo of boot screen:
https://i.postimg.cc/DwVbYP4b/IMG-20220525-130140.jpg

And the following trace is written to the log:
[    8.173558] [drm] amdgpu kernel modesetting enabled.
[    8.196766] amdgpu: Ignoring ACPI CRAT on non-APU system
[    8.196846] amdgpu: Virtual CRAT table created for CPU
[    8.197015] amdgpu: Topology: Add CPU node
[    8.201791] Console: switching to colour dummy device 80x25
[    8.215200] page:00000000b17305fd refcount:0 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x1029c00
[    8.215224] head:00000000b17305fd order:0 compound_mapcount:-6459
compound_pincount:0
[    8.215243] flags: 0x17ffffc0010000(head|node=0|zone=2|lastcpupid=0x1fffff)
[    8.215261] raw: 0017ffffc0010000 ffffe6c480a70008 ffffe6c480a70008
0000000000000000
[    8.215279] raw: 0000000000000000 0000000000000000 00000000ffffffff
0000000000000000
[    8.215296] page dumped because: VM_BUG_ON_PAGE(compound &&
compound_order(page) != order)
[    8.215324] ------------[ cut here ]------------
[    8.215340] kernel BUG at mm/page_alloc.c:1329!
[    8.215358] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    8.215363] CPU: 20 PID: 584 Comm: systemd-udevd Tainted: G
W         5.18.0-rc1-004-c6ed9f66eb70aeaac9998bd3552ada740d90e20c+
#357
[    8.215370] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[    8.215375] RIP: 0010:free_pcp_prepare+0x455/0x650
[    8.215381] Code: ff ff 48 8b 43 48 a8 01 0f 84 48 ff ff ff 48 83
e8 01 48 39 c3 0f 84 3b ff ff ff 48 c7 c6 08 f0 85 aa 48 89 df e8 5b
cb fc ff <0f> 0b 4c 89 ef 48 89 14 24 41 83 c6 01 e8 b9 ed ff ff 48 8b
14 24
[    8.215390] RSP: 0018:ffffbb7dc23779d8 EFLAGS: 00010296
[    8.215394] RAX: 000000000000004e RBX: ffffe6c480a70000 RCX: 0000000000000000
[    8.215399] RDX: 0000000000000001 RSI: ffffffffaa89db77 RDI: 00000000ffffffff
[    8.215402] RBP: 0000000000000009 R08: 0000000000000000 R09: ffffbb7dc23777c0
[    8.215406] R10: 0000000000000003 R11: ffffa08bae1fefe8 R12: 0000000000000000
[    8.215410] R13: ffffa07c817eadc0 R14: 00000000fffffe00 R15: ffffe6c480a70000
[    8.215414] FS:  00007f35b2f1ab40(0000) GS:ffffa08b5d200000(0000)
knlGS:0000000000000000
[    8.215419] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.215422] CR2: 00005631caec1878 CR3: 000000017d09c000 CR4: 0000000000350ee0
[    8.215427] Call Trace:
[    8.215429]  <TASK>
[    8.215431]  ? find_held_lock+0x32/0x80
[    8.215436]  free_unref_page+0x25/0x280
[    8.215440]  __vunmap+0x261/0x3d0
[    8.215444]  drm_fbdev_cleanup+0x6b/0xc0
[    8.215449]  drm_fbdev_fb_destroy+0x15/0x30
[    8.215453]  unregister_framebuffer+0x2e/0x40
[    8.215458]  drm_client_dev_unregister+0x6e/0xe0
[    8.215464]  drm_dev_unregister+0x34/0x90
[    8.215467]  drm_dev_unplug+0x24/0x40
[    8.215471]  simpledrm_remove+0x11/0x20
[    8.215475]  platform_remove+0x1f/0x40
[    8.215479]  device_release_driver_internal+0x1b8/0x220
[    8.215484]  bus_remove_device+0xef/0x160
[    8.215488]  device_del+0x18c/0x3f0
[    8.215492]  platform_device_del.part.0+0x13/0x70
[    8.215496]  platform_device_unregister+0x1c/0x30
[    8.215500]  drm_aperture_detach_drivers+0xa3/0xd0
[    8.215505]  drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x70
[    8.215511]  amdgpu_pci_probe+0x126/0x3c0 [amdgpu]
[    8.215672]  local_pci_probe+0x41/0x80
[    8.215677]  pci_device_probe+0xaa/0x200
[    8.215681]  really_probe+0x1a0/0x370
[    8.215685]  __driver_probe_device+0xfb/0x170
[    8.215689]  driver_probe_device+0x1f/0x90
[    8.215693]  __driver_attach+0xbe/0x1a0
[    8.215697]  ? __device_attach_driver+0xe0/0xe0
[    8.215701]  bus_for_each_dev+0x65/0x90
[    8.215705]  bus_add_driver+0x150/0x1f0
[    8.215709]  driver_register+0x89/0xd0
[    8.215713]  ? 0xffffffffc044e000
[    8.215719]  do_one_initcall+0x69/0x350
[    8.215724]  ? do_init_module+0x22/0x260
[    8.215728]  ? rcu_read_lock_sched_held+0x3b/0x70
[    8.215732]  ? trace_kmalloc+0x3b/0x100
[    8.215737]  ? kmem_cache_alloc_trace+0x1eb/0x3a0
[    8.215742]  do_init_module+0x4a/0x260
[    8.215745]  __do_sys_finit_module+0x93/0xf0
[    8.215751]  do_syscall_64+0x3a/0x80
[    8.215756]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    8.215761] RIP: 0033:0x7f35b3acb62d
[    8.215765] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e
fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 c7 0c 00 f7 d8 64 89
01 48
[    8.215773] RSP: 002b:00007ffc39f6ef68 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[    8.215778] RAX: ffffffffffffffda RBX: 00005631cae55830 RCX: 00007f35b3acb62d
[    8.215782] RDX: 0000000000000000 RSI: 00005631cae6ceb0 RDI: 0000000000000011
[    8.215786] RBP: 00005631cae6ceb0 R08: 0000000000000000 R09: 00007f35b3b98c80
[    8.215790] R10: 0000000000000011 R11: 0000000000000246 R12: 0000000000020000
[    8.215794] R13: 00005631cae74660 R14: 0000000000000000 R15: 00005631cae805d0
[    8.215800]  </TASK>
[    8.215801] Modules linked in: amdgpu(+) drm_ttm_helper ttm
crct10dif_pclmul crc32_pclmul iommu_v2 crc32c_intel gpu_sched ucsi_ccg
nvme drm_buddy typec_ucsi ghash_clmulni_intel igb ccp drm_dp_helper
typec sp5100_tco nvme_core dca wmi ip6_tables ip_tables ipmi_devintf
ipmi_msghandler fuse
[    8.215825] ---[ end trace 0000000000000000 ]---
[    8.215828] RIP: 0010:free_pcp_prepare+0x455/0x650
[    8.215832] Code: ff ff 48 8b 43 48 a8 01 0f 84 48 ff ff ff 48 83
e8 01 48 39 c3 0f 84 3b ff ff ff 48 c7 c6 08 f0 85 aa 48 89 df e8 5b
cb fc ff <0f> 0b 4c 89 ef 48 89 14 24 41 83 c6 01 e8 b9 ed ff ff 48 8b
14 24
[    8.215841] RSP: 0018:ffffbb7dc23779d8 EFLAGS: 00010296
[    8.215844] RAX: 000000000000004e RBX: ffffe6c480a70000 RCX: 0000000000000000
[    8.215848] RDX: 0000000000000001 RSI: ffffffffaa89db77 RDI: 00000000ffffffff
[    8.215852] RBP: 0000000000000009 R08: 0000000000000000 R09: ffffbb7dc23777c0
[    8.215856] R10: 0000000000000003 R11: ffffa08bae1fefe8 R12: 0000000000000000
[    8.215860] R13: ffffa07c817eadc0 R14: 00000000fffffe00 R15: ffffe6c480a70000
[    8.215864] FS:  00007f35b2f1ab40(0000) GS:ffffa08b5d200000(0000)
knlGS:0000000000000000
[    8.215875] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.215879] CR2: 00005631caec1878 CR3: 000000017d09c000 CR4: 0000000000350ee0
[    8.216344] systemd-udevd (584) used greatest stack depth: 12776 bytes left
Full kernel log is here: https://pastebin.com/rDAjKpSg

Please help me get rid of the bug that crashes systemd-udevd so I can
find the exact commit that caused the GPU hang.

Or, based on the trace of the hung process, help fix the problem.

Thank you all in advance.

UPD:
I am still observing the issue rc1-rc4 :(

My hardware specs:
GPU: 6900XT
CPU: 3950X
M/B: ROG Strix X570-I Gaming
RAM: 64GB
SSD: Intel Optane 905P

-- 
Best Regards,
Mike Gavrilov.