Managed to extract more information from one of the lock up: [53798.406792] [drm:amdgpu_ctx_mgr_entity_fini [amdgpu]] *ERROR* ctx 00000000865ddb9b is still alive [53798.406862] [drm:amdgpu_ctx_mgr_fini [amdgpu]] *ERROR* ctx 00000000865ddb9b is still alive Apparently the messages above have nothing to do, since there's a jump on the timestamp. [135442.607022] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=834761, emitted seq=834763 [135442.607111] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0 [135442.607117] amdgpu 0000:01:00.0: GPU reset begin! [135442.617041] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=11762398, emitted seq=11762400 [135442.617133] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 737 thread Xorg:cs0 pid 768 [135442.617139] amdgpu 0000:01:00.0: GPU reset begin! Why the GPU tries to reset twice? [135443.319047] amdgpu: [powerplay] failed to send message 5e ret is 0 [135443.806941] amdgpu: [powerplay] failed to send message 145 ret is 0 [135444.294821] amdgpu: [powerplay] failed to send message 146 ret is 0 [135444.782666] amdgpu: [powerplay] failed to send message 148 ret is 0 [135445.270683] amdgpu: [powerplay] failed to send message 145 ret is 0 [135445.759529] amdgpu: [powerplay] failed to send message 146 ret is 0 [135445.839244] [drm] REG_WAIT timeout 10us * 3500 tries - dce_mi_free_dmif line:636 Here, it start a trace. Should I report a bug even if I'm unable to reproduce the conditions that led to it? [135445.839340] WARNING: CPU: 1 PID: 24740 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:277 generic_reg_wait.cold.3+0x2a/0x31 [amdgpu] [135445.839341] Modules linked in: uinput nft_chain_route_ipv4 xt_CHECKSUM nft_chain_nat ipt_MASQUERADE nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 nft_counter xt_tcpudp nft_compat tun bridge stp llc nf_tables nfnetlink cpufreq_userspace cpufreq_powersave cpufreq_conservative edac_mce_amd kvm_amd ccp rng_core kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nls_ascii nls_cp437 vfat fat efi_pstore aesni_intel eeepc_wmi asus_wmi aes_x86_64 sparse_keymap crypto_simd rfkill cryptd snd_hda_codec_realtek glue_helper snd_hda_codec_generic pcspkr wmi_bmof joydev efivars k10temp ledtrig_audio snd_hda_codec_hdmi sp5100_tco amdgpu snd_hda_intel chash snd_hda_codec gpu_sched snd_hda_core ttm snd_hwdep snd_pcm sg snd_timer drm_kms_helper snd soundcore drm evdev button pcc_cpufreq acpi_cpufreq parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress raid6_pq libcrc32c crc32c_generic sd_mod [135445.839368] hid_generic usbhid hid ahci libahci igb xhci_pci crc32c_intel i2c_algo_bit dca libata xhci_hcd i2c_piix4 scsi_mod usbcore wmi video gpio_amdpt gpio_generic [135445.839377] CPU: 1 PID: 24740 Comm: kworker/1:0 Not tainted 5.1.0-rc3 #1 [135445.839378] Hardware name: System manufacturer System Product Name/ROG STRIX B350-F GAMING, BIOS 4207 12/07/2018 [135445.839383] Workqueue: events drm_sched_job_timedout [gpu_sched] [135445.839451] RIP: 0010:generic_reg_wait.cold.3+0x2a/0x31 [amdgpu] [135445.839453] Code: 44 8b 44 24 68 48 8b 4c 24 60 48 c7 c7 a8 a2 8b c0 8b 54 24 58 8b 74 24 04 e8 3a 88 8c e4 41 83 7c 24 20 01 0f 84 51 8d fe ff <0f> 0b e9 4a 8d fe ff e8 84 61 ed ff 48 c7 c7 00 80 95 c0 e8 a8 0f [135445.839454] RSP: 0018:ffffb8d7099777b0 EFLAGS: 00010297 [135445.839455] RAX: 0000000000000044 RBX: 0000000000000dad RCX: 0000000000000000 [135445.839456] RDX: 0000000000000000 RSI: ffff8db08ea56638 RDI: ffff8db08ea56638 [135445.839457] RBP: 0000000000000328 R08: 00000000000003fd R09: 0000000000000004 [135445.839458] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8db085256600 [135445.839459] R13: ffff8db086c4f000 R14: 0000000000000002 R15: 0000000000000010 [135445.839460] FS: 0000000000000000(0000) GS:ffff8db08ea40000(0000) knlGS:0000000000000000 [135445.839461] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [135445.839462] CR2: 00007fe3feab6000 CR3: 00000001c0df4000 CR4: 00000000003406e0 [135445.839463] Call Trace: [135445.839539] dce_mi_free_dmif+0xff/0x1e0 [amdgpu] [135445.839607] dce110_reset_hw_ctx_wrap+0x149/0x1e0 [amdgpu] [135445.839674] dce110_apply_ctx_to_hw+0x4b/0x490 [amdgpu] [135445.839728] ? amdgpu_pm_compute_clocks+0xb9/0x5d0 [amdgpu] [135445.839800] ? dm_pp_apply_display_requirements+0x196/0x1a0 [amdgpu] [135445.839864] dc_commit_state+0x2ae/0x550 [amdgpu] [135445.839936] ? dm_determine_update_type_for_commit+0x5a/0x410 [amdgpu] [135445.840006] amdgpu_dm_atomic_commit_tail+0x1fb/0x1930 [amdgpu] [135445.840011] ? kmalloc_order+0x14/0x30 [135445.840085] ? dm_determine_update_type_for_commit+0x5a/0x410 [amdgpu] [135445.840088] ? _cond_resched+0x15/0x30 [135445.840090] ? _cond_resched+0x15/0x30 [135445.840092] ? __ww_mutex_lock.isra.9+0x3a/0x710 [135445.840101] ? drm_atomic_helper_setup_commit+0x380/0x480 [drm_kms_helper] [135445.840104] ? __wake_up_common_lock+0x89/0xc0 [135445.840105] ? _cond_resched+0x15/0x30 [135445.840107] ? wait_for_completion_timeout+0x3b/0x1a0 [135445.840181] ? amdgpu_dm_atomic_commit_tail+0x1930/0x1930 [amdgpu] [135445.840254] ? dm_plane_helper_prepare_fb+0x257/0x360 [amdgpu] [135445.840263] ? commit_tail+0x3d/0x70 [drm_kms_helper] [135445.840271] commit_tail+0x3d/0x70 [drm_kms_helper] [135445.840279] drm_atomic_helper_commit+0xb4/0x120 [drm_kms_helper] [135445.840285] drm_atomic_helper_disable_all+0x14d/0x160 [drm_kms_helper] [135445.840292] drm_atomic_helper_suspend+0xc4/0xf0 [drm_kms_helper] [135445.840356] dm_suspend+0x1c/0x60 [amdgpu] [135445.840402] amdgpu_device_ip_suspend_phase1+0x9a/0xd0 [amdgpu] [135445.840448] amdgpu_device_ip_suspend+0x1b/0x60 [amdgpu] [135445.840508] amdgpu_device_pre_asic_reset+0x1ef/0x204 [amdgpu] [135445.840567] amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu] [135445.840627] amdgpu_job_timedout+0xfc/0x120 [amdgpu] [135445.840631] drm_sched_job_timedout+0x39/0x60 [gpu_sched] [135445.840634] process_one_work+0x1a7/0x3b0 [135445.840636] worker_thread+0x30/0x390 [135445.840638] ? create_worker+0x1a0/0x1a0 [135445.840639] kthread+0x112/0x130 [135445.840640] ? __kthread_parkme+0x70/0x70 [135445.840642] ret_from_fork+0x22/0x40 [135445.840645] ---[ end trace 660b6e2adb0f7476 ]--- Later, powerplay seems to fail a bunch of time, with messages ranging from 5e to 186 [135446.364392] amdgpu: [powerplay] failed to send message 148 ret is 0 [135446.887086] amdgpu: [powerplay] failed to send message 145 ret is 0 [135447.409654] amdgpu: [powerplay] failed to send message 146 ret is 0 [135447.933340] amdgpu: [powerplay] failed to send message 133 ret is 0 [135447.933342] amdgpu: [powerplay] VI should always have 2 performance levels [135448.456153] amdgpu: [powerplay] failed to send message 5e ret is 0 [135448.978944] amdgpu: [powerplay] failed to send message 145 ret is 0 [135449.501800] amdgpu: [powerplay] failed to send message 146 ret is 0 [135450.024313] amdgpu: [powerplay] failed to send message 148 ret is 0 [135450.546863] amdgpu: [powerplay] failed to send message 145 ret is 0 [135451.069831] amdgpu: [powerplay] failed to send message 146 ret is 0 [135451.592819] amdgpu: [powerplay] failed to send message 16a ret is 0 [135452.115338] amdgpu: [powerplay] failed to send message 186 ret is 0 [135452.637920] amdgpu: [powerplay] failed to send message 54 ret is 0 [135453.160464] amdgpu: [powerplay] failed to send message 13d ret is 0 [135453.682876] amdgpu: [powerplay] failed to send message 14f ret is 0 [135454.205768] amdgpu: [powerplay] failed to send message 151 ret is 0 [135454.728890] amdgpu: [powerplay] failed to send message 135 ret is 0 [135455.252057] amdgpu: [powerplay] failed to send message 190 ret is 0 [135455.774989] amdgpu: [powerplay] failed to send message 63 ret is 0 [135456.297489] amdgpu: [powerplay] failed to send message 84 ret is 0 There seems to be a newline somewhere, since powerplay messages aren't on the same line as the timestamp. After that, the gpu tries to recover, but fails miserably: [135456.299369] amdgpu 0000:01:00.0: GPU pci config reset [135456.306844] amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume [135456.306964] [drm] PCIE gen 3 link speeds already enabled [135456.310683] [drm] PCIE GART of 1024M enabled (table at 0x000000F400300000). [135456.310768] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost! [135456.590097] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110) [135456.590145] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v7_0> failed -110 [135456.590169] amdgpu 0000:01:00.0: GPU reset(2) failed This line gets repeated about 10 times: [135456.590183] [drm] Skip scheduling IBs! And the kernel tries to reset the gpu again: [135456.590362] amdgpu 0000:01:00.0: GPU reset end with ret = -110 [135466.655925] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=834763, emitted seq=834763 [135466.656025] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0 [135466.656032] amdgpu 0000:01:00.0: GPU reset begin! [135578.700184] INFO: task kworker/0:2:24837 blocked for more than 120 seconds. [135578.700189] Tainted: G W 5.1.0-rc3 #1 [135578.700191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [135578.700194] kworker/0:2 D 0 24837 2 0x80000000 [135578.700206] Workqueue: events drm_sched_job_timedout [gpu_sched] [135578.700208] Call Trace: [135578.700217] ? __schedule+0x2d4/0x870 [135578.700220] ? __switch_to_asm+0x34/0x70 [135578.700223] ? __switch_to_asm+0x40/0x70 [135578.700225] schedule+0x28/0x70 [135578.700228] schedule_timeout+0x26d/0x390 [135578.700230] ? __schedule+0x2dc/0x870 [135578.700234] dma_fence_default_wait+0x238/0x2a0 [135578.700236] ? dma_fence_release+0x90/0x90 [135578.700239] dma_fence_wait_timeout+0xdd/0x100 [135578.700243] drm_sched_stop+0xf2/0x130 [gpu_sched] [135578.700340] amdgpu_device_pre_asic_reset+0x3f/0x204 [amdgpu] [135578.700428] amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu] [135578.700514] amdgpu_job_timedout+0xfc/0x120 [amdgpu] [135578.700571] drm_sched_job_timedout+0x39/0x60 [gpu_sched] [135578.700576] process_one_work+0x1a7/0x3b0 [135578.700578] worker_thread+0x30/0x390 [135578.700581] ? create_worker+0x1a0/0x1a0 [135578.700583] kthread+0x112/0x130 [135578.700585] ? __kthread_parkme+0x70/0x70 [135578.700587] ret_from_fork+0x22/0x40 [135578.700594] INFO: task kworker/0:1:29946 blocked for more than 120 seconds. [135578.700596] Tainted: G W 5.1.0-rc3 #1 [135578.700598] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [135578.700599] kworker/0:1 D 0 29946 2 0x80000000 [135578.700605] Workqueue: events drm_sched_job_finish [gpu_sched] [135578.700606] Call Trace: [135578.700609] ? __schedule+0x2d4/0x870 [135578.700611] schedule+0x28/0x70 [135578.700613] schedule_timeout+0x26d/0x390 [135578.700616] wait_for_completion+0x11f/0x190 [135578.700619] ? wake_up_q+0x80/0x80 [135578.700622] __flush_work+0x130/0x1f0 [135578.700624] ? worker_attach_to_pool+0x90/0x90 [135578.700626] ? get_work_pool+0x40/0x40 [135578.700629] __cancel_work_timer+0x10a/0x190 [135578.700632] ? __switch_to+0x15b/0x460 [135578.700634] ? __switch_to_asm+0x34/0x70 [135578.700636] ? __switch_to_asm+0x40/0x70 [135578.700639] drm_sched_job_finish+0x25/0x60 [gpu_sched] [135578.700642] process_one_work+0x1a7/0x3b0 [135578.700644] worker_thread+0x30/0x390 [135578.700647] ? create_worker+0x1a0/0x1a0 [135578.700648] kthread+0x112/0x130 [135578.700650] ? __kthread_parkme+0x70/0x70 [135578.700652] ret_from_fork+0x22/0x40 [135699.405742] INFO: task kworker/0:2:24837 blocked for more than 241 seconds. [135699.405748] Tainted: G W 5.1.0-rc3 #1 [135699.405750] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [135699.405752] kworker/0:2 D 0 24837 2 0x80000000 [135699.405764] Workqueue: events drm_sched_job_timedout [gpu_sched] [135699.405765] Call Trace: [135699.405773] ? __schedule+0x2d4/0x870 [135699.405776] ? __switch_to_asm+0x34/0x70 [135699.405779] ? __switch_to_asm+0x40/0x70 [135699.405782] schedule+0x28/0x70 [135699.405785] schedule_timeout+0x26d/0x390 [135699.405787] ? __schedule+0x2dc/0x870 [135699.405791] dma_fence_default_wait+0x238/0x2a0 [135699.405794] ? dma_fence_release+0x90/0x90 [135699.405796] dma_fence_wait_timeout+0xdd/0x100 [135699.405800] drm_sched_stop+0xf2/0x130 [gpu_sched] [135699.405900] amdgpu_device_pre_asic_reset+0x3f/0x204 [amdgpu] [135699.405991] amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu] [135699.406081] amdgpu_job_timedout+0xfc/0x120 [amdgpu] [135699.406086] drm_sched_job_timedout+0x39/0x60 [gpu_sched] [135699.406089] process_one_work+0x1a7/0x3b0 [135699.406092] worker_thread+0x30/0x390 [135699.406095] ? create_worker+0x1a0/0x1a0 [135699.406096] kthread+0x112/0x130 [135699.406098] ? __kthread_parkme+0x70/0x70 [135699.406101] ret_from_fork+0x22/0x40 [135699.406108] INFO: task kworker/0:1:29946 blocked for more than 241 seconds. [135699.406110] Tainted: G W 5.1.0-rc3 #1 [135699.406111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [135699.406113] kworker/0:1 D 0 29946 2 0x80000000 [135699.406118] Workqueue: events drm_sched_job_finish [gpu_sched] [135699.406119] Call Trace: [135699.406122] ? __schedule+0x2d4/0x870 [135699.406125] schedule+0x28/0x70 [135699.406127] schedule_timeout+0x26d/0x390 [135699.406130] wait_for_completion+0x11f/0x190 [135699.406133] ? wake_up_q+0x80/0x80 [135699.406136] __flush_work+0x130/0x1f0 [135699.406138] ? worker_attach_to_pool+0x90/0x90 [135699.406141] ? get_work_pool+0x40/0x40 [135699.406143] __cancel_work_timer+0x10a/0x190 [135699.406145] ? __switch_to+0x15b/0x460 [135699.406148] ? __switch_to_asm+0x34/0x70 [135699.406150] ? __switch_to_asm+0x40/0x70 [135699.406153] drm_sched_job_finish+0x25/0x60 [gpu_sched] [135699.406156] process_one_work+0x1a7/0x3b0 [135699.406158] worker_thread+0x30/0x390 [135699.406161] ? create_worker+0x1a0/0x1a0 [135699.406162] kthread+0x112/0x130 [135699.406164] ? __kthread_parkme+0x70/0x70 [135699.406167] ret_from_fork+0x22/0x40 [135699.406170] INFO: task kworker/2:0:29950 blocked for more than 120 seconds. [135699.406171] Tainted: G W 5.1.0-rc3 #1 [135699.406173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [135699.406174] kworker/2:0 D 0 29950 2 0x80000000 [135699.406179] Workqueue: events drm_sched_job_timedout [gpu_sched] [135699.406180] Call Trace: [135699.406183] ? __schedule+0x2d4/0x870 [135699.406185] schedule+0x28/0x70 [135699.406187] schedule_preempt_disabled+0xa/0x10 [135699.406190] __mutex_lock.isra.8+0x2d0/0x4a0 [135699.406277] amdgpu_device_lock_adev+0x15/0x37 [amdgpu] [135699.406364] amdgpu_device_gpu_recover+0x69/0x743 [amdgpu] [135699.406453] amdgpu_job_timedout+0xfc/0x120 [amdgpu] [135699.406457] drm_sched_job_timedout+0x39/0x60 [gpu_sched] [135699.406460] process_one_work+0x1a7/0x3b0 [135699.406463] worker_thread+0x30/0x390 [135699.406465] ? create_worker+0x1a0/0x1a0 [135699.406467] kthread+0x112/0x130 [135699.406468] ? __kthread_parkme+0x70/0x70 [135699.406471] ret_from_fork+0x22/0x40 At this point I decided that the system isn't going to recover, so I tried to soft-reboot it, but it only ended locked and had to hard reboot. I've found this bug[1] that seems related, user watching videos. I will try removing vaapi and check if that improves anything. [1]: https://bugs.freedesktop.org/show_bug.cgi?id=106547 On Wed, Apr 3, 2019 at 6:38 PM Braiam <braiamp@xxxxxxxxx> wrote: > > On Wed, Apr 3, 2019 at 2:02 PM Alex Deucher <alexdeucher@xxxxxxxxx> wrote: > > > > On Wed, Apr 3, 2019 at 2:58 AM Braiam <braiamp@xxxxxxxxx> wrote: > > > > > > Hi, > > > > > > I have a Sapphire Technology Hawaii XT (R9 290X) using amdgpu driver > > > with kernel 5.1.0-rc3. > > > The issue happens with current 4.19.0 debian testing, 4.20-trunk, > > > 5.0.0-trunk and rc2 and 3. > > > > > > It usually happens when I'm reproducing video, but I haven't figured > > > out a way to reproduce it. It > > > happened once without reproducing. I'm aware that the support is > > > experimental, but radeon > > > driver doesn't seems capable of direct rendering on this card dropping > > > to llvmepipe. > > > > Radeon should work out of the box. Maybe something is messed up with > > your install? > > Doubtful, since I used 2200G without issues. Only change was using amdgpu > due games being particularly slow. > > > > > > > > > I had a ssh server installed in case I could log in while it crashes, > > > and the only relevant > > > line I found was: > > > > > > drm:amdgpu job timeout [amdgpu]] **ERROR** ring gfx timeout, signaled > > > seq=399919, emitted seq=399921 > > > > > > But that turned several bug reports which seems to have been fixed and > > > the context and symptoms are too different to mine. > > > > > > > You appear to be experiencing a GPU lockup. Unfortunately, there can > > be many things that cause it, so it really helps to have a good > > reproducer case. You might try a newer version of mesa or llvm. What > > does your "reproducing video" work flow use? What apps, APIs are > > involved? > > By "reproducing video" I mean watching videos from either Youtube/Netflix > with Firefox or mpv for local files. But note that I've experienced at > least one lock up > without any video on screen (I don't remember if there was something paused > on the background). > > Using mesa 18.3.4, and according to glxinfo DRM 3.30.0, LLVM 7.0.1. With > vulkan, vdpau and va drivers on the same version. I gave up and removed > any and every instance of the rocm package, since the module couldn't > compile on the kernel. > > Another detail about the lock up: the screen doesn't freeze but no > signal is emitted, > so my monitor goes to sleep. > > Upgraded to mesa 19.0.1 and llvm 8.0.0, although I didn't see anything > relevant on the > changelogs that could affect my experience. > > > > > > Alex > > > > > I have tried forcing the amdgpu xorg driver with same results (was > > > using radeon). > > > > > > -- > > > Braiam > > > _______________________________________________ > > > amd-gfx mailing list > > > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > > > -- > Braiam -- Braiam _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx