On Thu, Jul 7, 2022 at 2:50 PM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > Am 07.07.22 um 02:20 schrieb Mikhail Gavrilov: > > On Tue, Jun 28, 2022 at 2:21 PM Mikhail Gavrilov > > <mikhail.v.gavrilov@xxxxxxxxx> wrote: > > Christian can you look why > > drm_aperture_remove_conflicting_pci_framebuffers cause this kernel bug > > on my machine? > > That looks like a problem outside of the amdgpu driver. > > What happens is that during load amdgpu requests whatever driver > (vesafb,vgafb or efifb) is currently handling the framebuffer to unload. > This unload in turn now crashes for some reason. > > My best suggestion is to try to bisect this. Hi Christian, if you read my initial post. You should see that I tried to bisect the issue. But it is very problematic because on each step I see different symptomes. And if mark different symptoms with skip step we got at end lot of possible commits: Here is my bisect from initial post: https://pastebin.com/AhLMNfyv If you want that I ended bisection successfully please help how to fix this oops: [ 8.291177] page:00000000af2b6334 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x102a000 [ 8.291202] head:00000000af2b6334 order:0 compound_mapcount:-1226 compound_pincount:0 [ 8.291221] flags: 0x17ffffc0010000(head|node=0|zone=2|lastcpupid=0x1fffff) [ 8.291239] raw: 0017ffffc0010000 fffffb35c0a80008 fffffb35c0a80008 0000000000000000 [ 8.291257] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 8.291275] page dumped because: VM_BUG_ON_PAGE(compound && compound_order(page) != order) [ 8.291298] ------------[ cut here ]------------ [ 8.291309] kernel BUG at mm/page_alloc.c:1329! [ 8.291324] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 8.291328] CPU: 8 PID: 599 Comm: systemd-udevd Not tainted 5.18.0-rc2-003-790b45f1bc6736a8dd48ba5731b6871e0217311e+ #361 [ 8.291333] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 8.291338] RIP: 0010:free_pcp_prepare+0x58d/0x5a0 [ 8.291343] Code: c6 18 a2 85 a7 e8 d3 b7 fc ff 0f 0b 31 f6 48 89 df e8 97 cf 06 00 e9 29 ff ff ff 48 c7 c6 00 f1 85 a7 48 89 df e8 b3 b7 fc ff <0f> 0b 48 c7 c6 58 92 85 a7 e8 a5 b7 fc ff 0f 0b 0f 1f 00 0f 1f 44 [ 8.291351] RSP: 0018:ffffb07c023ab9d8 EFLAGS: 00010296 [ 8.291354] RAX: 000000000000004e RBX: fffffb35c0a80000 RCX: 0000000000000000 [ 8.291358] RDX: 0000000000000001 RSI: ffffffffa789dbaf RDI: 00000000ffffffff [ 8.291361] RBP: 0000000000000009 R08: 0000000000000000 R09: ffffb07c023ab7c0 [ 8.291365] R10: 0000000000000003 R11: ffff92ee2e2fffe8 R12: 0000000000000000 [ 8.291368] R13: ffff92ee2a55d180 R14: 00000000fffffe00 R15: fffffb35c0a80000 [ 8.291371] FS: 00007f80aa398680(0000) GS:ffff92edda200000(0000) knlGS:0000000000000000 [ 8.291376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8.291379] CR2: 00007f80aa38e616 CR3: 000000017d726000 CR4: 0000000000350ee0 [ 8.291382] Call Trace: [ 8.291384] <TASK> [ 8.291386] ? find_held_lock+0x32/0x80 [ 8.291391] free_unref_page+0x25/0x2a0 [ 8.291395] __vunmap+0x261/0x3d0 [ 8.291399] drm_fbdev_cleanup+0x6b/0xc0 [ 8.291403] drm_fbdev_fb_destroy+0x15/0x30 [ 8.291407] unregister_framebuffer+0x2e/0x40 [ 8.291411] drm_client_dev_unregister+0x6e/0xe0 [ 8.291416] drm_dev_unregister+0x34/0x90 [ 8.291419] drm_dev_unplug+0x24/0x40 [ 8.291422] simpledrm_remove+0x11/0x20 [ 8.291426] platform_remove+0x1f/0x40 [ 8.291429] device_release_driver_internal+0x1b8/0x220 [ 8.291433] bus_remove_device+0xef/0x160 [ 8.291437] device_del+0x18c/0x3f0 [ 8.291440] platform_device_del.part.0+0x13/0x70 [ 8.291444] platform_device_unregister+0x1c/0x30 [ 8.291447] drm_aperture_detach_drivers+0xa3/0xd0 [ 8.291452] drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x70 [ 8.291457] amdgpu_pci_probe+0x126/0x3c0 [amdgpu] [ 8.291599] local_pci_probe+0x41/0x80 [ 8.291604] pci_device_probe+0xaa/0x200 [ 8.291607] really_probe+0x1a0/0x370 [ 8.291611] __driver_probe_device+0xfb/0x170 [ 8.291615] driver_probe_device+0x1f/0x90 [ 8.291618] __driver_attach+0xbe/0x1a0 [ 8.291622] ? __device_attach_driver+0xe0/0xe0 [ 8.291625] bus_for_each_dev+0x65/0x90 [ 8.291629] bus_add_driver+0x150/0x1f0 [ 8.291632] driver_register+0x89/0xd0 [ 8.291636] ? 0xffffffffc067b000 [ 8.291641] do_one_initcall+0x69/0x350 [ 8.291645] ? do_init_module+0x22/0x260 [ 8.291650] ? rcu_read_lock_sched_held+0x3b/0x70 [ 8.291654] ? trace_kmalloc+0x3b/0x100 [ 8.291658] ? kmem_cache_alloc_trace+0x1eb/0x3a0 [ 8.291662] do_init_module+0x4a/0x260 [ 8.291666] __do_sys_finit_module+0x93/0xf0 [ 8.291673] do_syscall_64+0x3a/0x80 [ 8.291677] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 8.291681] RIP: 0033:0x7f80aaf4507d [ 8.291685] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 dd 0c 00 f7 d8 64 89 01 48 [ 8.291694] RSP: 002b:00007ffe43973ce8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 8.291699] RAX: ffffffffffffffda RBX: 000055ce603a3fe0 RCX: 00007f80aaf4507d [ 8.291702] RDX: 0000000000000000 RSI: 000055ce60395ac0 RDI: 0000000000000011 [ 8.291706] RBP: 000055ce60395ac0 R08: 0000000000000000 R09: 00007f80ab013c80 [ 8.291709] R10: 0000000000000011 R11: 0000000000000246 R12: 0000000000020000 [ 8.291713] R13: 000055ce60387c30 R14: 0000000000000000 R15: 000055ce6038ede0 [ 8.291718] </TASK> [ 8.291719] Modules linked in: amdgpu(+) drm_ttm_helper ttm crct10dif_pclmul crc32_pclmul iommu_v2 crc32c_intel gpu_sched ucsi_ccg typec_ucsi nvme drm_buddy igb ccp ghash_clmulni_intel typec drm_dp_helper sp5100_tco nvme_core dca wmi ip6_tables ip_tables ipmi_devintf ipmi_msghandler fuse [ 8.291740] ---[ end trace 0000000000000000 ]--- > Thanks for reporting. This bug has been fixed in > > > https://cgit.freedesktop.org/drm/drm/commit/?h=drm-fixes&id=ee7a69aa38d87a3bbced7b8245c732c05ed0c6ec > > The patch should reach mainline next week or so. Hi Thomas, thanks for the patch, this patch fixes oops But this patch does not fix the initial issue when a lot of processes blocked by mutex which applied by amdgpu_ctx_mgr_entity_flush [ 249.491425] INFO: task (brt-dbus):1634 blocked for more than 122 seconds. [ 249.491520] Tainted: G W L -------- --- 5.19.0-0.rc5.20220707git9f09069cde34.43.fc37.x86_64 #1 [ 249.491526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 249.491529] task:(brt-dbus) state:D stack:14504 pid: 1634 ppid: 1 flags:0x00000002 [ 249.491541] Call Trace: [ 249.491545] <TASK> [ 249.491556] __schedule+0x492/0x1620 [ 249.491565] ? lock_is_held_type+0xe8/0x140 [ 249.491575] ? find_held_lock+0x32/0x80 [ 249.491590] schedule+0x4e/0xb0 [ 249.491597] schedule_preempt_disabled+0x14/0x20 [ 249.491603] __mutex_lock+0x423/0x890 [ 249.491609] ? __lock_acquire+0x387/0x1ee0 [ 249.491618] ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.491849] ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.492040] amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.492237] amdgpu_flush+0x25/0x40 [amdgpu] [ 249.492420] filp_close+0x31/0x70 [ 249.492429] __close_range+0x1f3/0x490 [ 249.492441] __x64_sys_close_range+0x13/0x20 [ 249.492446] do_syscall_64+0x5b/0x80 [ 249.492452] ? lockdep_hardirqs_on+0x7d/0x100 [ 249.492459] ? do_syscall_64+0x67/0x80 [ 249.492467] ? asm_exc_page_fault+0x27/0x30 [ 249.492473] ? lockdep_hardirqs_on+0x7d/0x100 [ 249.492480] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 249.492486] RIP: 0033:0x7f789d4c23cb [ 249.492512] RSP: 002b:00007ffe10773198 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 [ 249.492518] RAX: ffffffffffffffda RBX: 00007ffe107731a0 RCX: 00007f789d4c23cb [ 249.492523] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000027 [ 249.492527] RBP: 00007ffe10773220 R08: 0000000000000000 R09: 00007ffe10773270 [ 249.492531] R10: 00007ffe107730e0 R11: 0000000000000246 R12: 0000000000000002 [ 249.492534] R13: 00007ffe10773230 R14: 0000000000000000 R15: 0000000000000002 [ 249.492555] </TASK> [ 249.492559] INFO: task (time-dir):1640 blocked for more than 122 seconds. [ 249.492564] Tainted: G W L -------- --- 5.19.0-0.rc5.20220707git9f09069cde34.43.fc37.x86_64 #1 [ 249.492568] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 249.492571] task:(time-dir) state:D stack:14504 pid: 1640 ppid: 1 flags:0x00000002 [ 249.492580] Call Trace: [ 249.492584] <TASK> [ 249.492592] __schedule+0x492/0x1620 [ 249.492597] ? lock_is_held_type+0xe8/0x140 [ 249.492605] ? find_held_lock+0x32/0x80 [ 249.492620] schedule+0x4e/0xb0 [ 249.492627] schedule_preempt_disabled+0x14/0x20 [ 249.492632] __mutex_lock+0x423/0x890 [ 249.492638] ? __lock_acquire+0x387/0x1ee0 [ 249.492646] ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.492859] ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.493049] amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.493245] amdgpu_flush+0x25/0x40 [amdgpu] [ 249.493428] filp_close+0x31/0x70 [ 249.493436] __close_range+0x1f3/0x490 [ 249.493447] __x64_sys_close_range+0x13/0x20 [ 249.493452] do_syscall_64+0x5b/0x80 [ 249.493457] ? lockdep_hardirqs_on+0x7d/0x100 [ 249.493465] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 249.493470] RIP: 0033:0x7f789d4c23cb [ 249.493478] RSP: 002b:00007ffe10773198 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 [ 249.493484] RAX: ffffffffffffffda RBX: 00007ffe107731a0 RCX: 00007f789d4c23cb [ 249.493488] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000027 [ 249.493492] RBP: 00007ffe10773220 R08: 0000000000000000 R09: 00007ffe10773270 [ 249.493496] R10: 00007ffe107730e0 R11: 0000000000000246 R12: 0000000000000002 [ 249.493499] R13: 00007ffe10773230 R14: 0000000000000000 R15: 0000000000000002 [ 249.493519] </TASK> [ 249.493528] Showing all locks held in the system: [ 249.493537] 1 lock held by khungtaskd/182: [ 249.493542] #0: ffffffffba168e20 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x16b [ 249.493565] 1 lock held by systemd-journal/879: [ 249.493575] 3 locks held by gnome-shell/1633: [ 249.493579] #0: ffff9bd4be4f8c00 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x3c/0x880 [ 249.493593] #1: ffff9bd4be4f8ca8 (&sig->exec_update_lock){++++}-{3:3}, at: begin_new_exec+0x384/0xca0 [ 249.493607] #2: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at: amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.493807] 1 lock held by (brt-dbus)/1634: [ 249.493811] #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at: amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.494025] 1 lock held by (time-dir)/1640: [ 249.494029] #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at: amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.494229] 1 lock held by (ostnamed)/1723: [ 249.494233] #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at: amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.494432] 1 lock held by (pcscd)/1748: [ 249.494436] #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at: amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu] [ 249.494639] ============================================= Here is pastebin from initial post: https://pastebin.com/0YHs6wyB