Hi Alex, I had one issue while testing this patch on a RX5700. After a gfx hang a reset is executed. Switching to a VT and restarting gdm works fine but the clocks seem messed up: - lots of graphical artifcats (underflows?) - pp_dpm_sclk and pp_dpm_socclk have strange values (see attached files) dmesg output (from the gfx hang): [ 169.755071] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! [ 169.755173] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! [ 174.874847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=10034, emitted seq=10036 [ 174.874925] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1724 thread gnome-shel:cs0 pid 1741 [ 174.874933] amdgpu 0000:0b:00.0: GPU reset begin! [ 174.875192] ------------[ cut here ]------------ [ 174.875282] WARNING: CPU: 0 PID: 7 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2969 dcn20_validate_bandwidth+0x87/0xe0 [amdgpu] [ 174.875283] Modules linked in: binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_intel_nhlt(E) snd_hda_codec(E) snd_hwdep(E) efi_pstore(E) snd_hda_core(E) snd_pcm(E) snd_timer(E) snd(E) ccp(E) xpad(E) wmi_bmof(E) mxm_wmi(E) evdev(E) joydev(E) ff_memless(E) efivars(E) soundcore(E) pcspkr(E) k10temp(E) sp5100_tco(E) rng_core(E) sg(E) wmi(E) button(E) acpi_cpufreq(E) uinput(E) parport_pc(E) ppdev(E) lp(E) parport(E) efivarfs(E) ip_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) dm_crypt(E) dm_mod(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) amdgpu(E) gpu_sched(E) ttm(E) drm_kms_helper(E) aesni_intel(E) glue_helper(E) xhci_pci(E) ahci(E) crypto_simd(E) libahci(E) cryptd(E) drm(E) xhci_hcd(E) i2c_piix4(E) libata(E) igb(E) usbcore(E) scsi_mod(E) dca(E) i2c_algo_bit(E) gpio_amdpt(E) [ 174.875310] gpio_generic(E) [ 174.875313] CPU: 0 PID: 7 Comm: kworker/0:1 Tainted: G E 5.4.0-rc7-02679-g9d664d914f0e #10 [ 174.875314] Hardware name: Gigabyte Technology Co., Ltd. X470 AORUS ULTRA GAMING/X470 AORUS ULTRA GAMING-CF, BIOS F6 01/25/2019 [ 174.875318] Workqueue: events drm_sched_job_timedout [gpu_sched] [ 174.875404] RIP: 0010:dcn20_validate_bandwidth+0x87/0xe0 [amdgpu] [ 174.875406] Code: 2d 44 22 a5 e8 1d 00 00 75 26 f2 0f 11 85 a8 21 00 00 31 d2 48 89 ee 4c 89 ef e8 d4 f5 ff ff 41 89 c4 22 85 e8 1d 00 00 75 4a <0f> 0b eb 02 75 d1 f2 0f 10 14 24 f2 0f 11 95 a8 21 00 00 e8 f1 4b [ 174.875407] RSP: 0018:ffffa87880067a90 EFLAGS: 00010246 [ 174.875408] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000001e61 [ 174.875409] RDX: 0000000000001e60 RSI: ffff981f3ec2d880 RDI: 000000000002d880 [ 174.875409] RBP: ffff981f25a60000 R08: 0000000000000006 R09: 0000000000000000 [ 174.875410] R10: 0000000100000000 R11: 0000000100000001 R12: 0000000000000001 [ 174.875411] R13: ffff981f1af40000 R14: 0000000000000000 R15: ffff981f25a60000 [ 174.875412] FS: 0000000000000000(0000) GS:ffff981f3ec00000(0000) knlGS:0000000000000000 [ 174.875413] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 174.875414] CR2: 00007f42858f3000 CR3: 00000007f02ae000 CR4: 00000000003406f0 [ 174.875414] Call Trace: [ 174.875498] dc_validate_global_state+0x25f/0x2d0 [amdgpu] [ 174.875583] amdgpu_dm_atomic_check+0x8ff/0xf20 [amdgpu] [ 174.875587] ? __ww_mutex_lock.isra.0+0x3a/0x760 [ 174.875590] ? _cond_resched+0x15/0x30 [ 174.875591] ? __ww_mutex_lock.isra.0+0x3a/0x760 [ 174.875606] drm_atomic_check_only+0x554/0x7e0 [drm] [ 174.875620] ? drm_connector_list_iter_next+0x7d/0x90 [drm] [ 174.875632] drm_atomic_commit+0x13/0x50 [drm] [ 174.875640] drm_atomic_helper_disable_all+0x14c/0x160 [drm_kms_helper] [ 174.875647] drm_atomic_helper_suspend+0x60/0xf0 [drm_kms_helper] [ 174.875730] dm_suspend+0x1c/0x60 [amdgpu] [ 174.875782] amdgpu_device_ip_suspend_phase1+0x81/0xe0 [amdgpu] [ 174.875836] amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu] [ 174.875923] amdgpu_device_pre_asic_reset+0x191/0x1a4 [amdgpu] [ 174.876010] amdgpu_device_gpu_recover.cold+0x43a/0xbca [amdgpu] [ 174.876084] amdgpu_job_timedout+0x103/0x130 [amdgpu] [ 174.876088] drm_sched_job_timedout+0x7f/0xe0 [gpu_sched] [ 174.876092] process_one_work+0x1b5/0x360 [ 174.876094] worker_thread+0x50/0x3c0 [ 174.876096] kthread+0xf9/0x130 [ 174.876097] ? process_one_work+0x360/0x360 [ 174.876099] ? kthread_park+0x90/0x90 [ 174.876100] ret_from_fork+0x22/0x40 [ 174.876103] ---[ end trace af4365804bf318ce ]--- [ 175.346937] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed [ 175.600179] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed [ 175.853418] [drm:gfx_v10_0_cp_gfx_enable [amdgpu]] *ERROR* failed to halt cp gfx [ 175.874639] [drm] psp command (0x2) failed and response status is (0x117) [ 178.963249] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume [ 178.963364] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000). [ 178.964227] [drm] PSP is resuming... [ 179.130413] [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized. [ 179.130492] [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized. [ 179.138589] [drm] reserve 0xa00000 from 0x81fe400000 for PSP TMR [ 179.330602] amdgpu 0000:0b:00.0: RAS: ras ta ucode is not available [ 179.354974] amdgpu: [powerplay] SMU is resuming... [ 179.357614] amdgpu: [powerplay] dpm has been disabled [ 179.357617] amdgpu: [powerplay] SMU is resumed successfully! [ 179.649168] [drm] kiq ring mec 2 pipe 1 q 0 [ 179.663833] [drm] VCN decode and encode initialized successfully(under DPG Mode). [ 179.663872] [drm] JPEG decode initialized successfully. [ 179.663877] amdgpu 0000:0b:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 [ 179.663878] amdgpu 0000:0b:00.0: ring gfx_0.1.0 uses VM inv eng 1 on hub 0 [ 179.663880] amdgpu 0000:0b:00.0: ring comp_1.0.0 uses VM inv eng 4 on hub 0 [ 179.663881] amdgpu 0000:0b:00.0: ring comp_1.1.0 uses VM inv eng 5 on hub 0 [ 179.663882] amdgpu 0000:0b:00.0: ring comp_1.2.0 uses VM inv eng 6 on hub 0 [ 179.663883] amdgpu 0000:0b:00.0: ring comp_1.3.0 uses VM inv eng 7 on hub 0 [ 179.663884] amdgpu 0000:0b:00.0: ring comp_1.0.1 uses VM inv eng 8 on hub 0 [ 179.663886] amdgpu 0000:0b:00.0: ring comp_1.1.1 uses VM inv eng 9 on hub 0 [ 179.663887] amdgpu 0000:0b:00.0: ring comp_1.2.1 uses VM inv eng 10 on hub 0 [ 179.663888] amdgpu 0000:0b:00.0: ring comp_1.3.1 uses VM inv eng 11 on hub 0 [ 179.663889] amdgpu 0000:0b:00.0: ring kiq_2.1.0 uses VM inv eng 12 on hub 0 [ 179.663890] amdgpu 0000:0b:00.0: ring sdma0 uses VM inv eng 13 on hub 0 [ 179.663891] amdgpu 0000:0b:00.0: ring sdma1 uses VM inv eng 14 on hub 0 [ 179.663892] amdgpu 0000:0b:00.0: ring vcn_dec uses VM inv eng 0 on hub 1 [ 179.663893] amdgpu 0000:0b:00.0: ring vcn_enc0 uses VM inv eng 1 on hub 1 [ 179.663894] amdgpu 0000:0b:00.0: ring vcn_enc1 uses VM inv eng 4 on hub 1 [ 179.663895] amdgpu 0000:0b:00.0: ring jpeg_dec uses VM inv eng 5 on hub 1 [ 179.675804] [drm] recover vram bo from shadow start [ 179.677342] [drm] recover vram bo from shadow done [ 179.677344] [drm] Skip scheduling IBs! [ 179.677345] [drm] Skip scheduling IBs! [ 179.677357] amdgpu 0000:0b:00.0: GPU reset(1) succeeded! [ 179.677367] [drm] Skip scheduling IBs! [ 179.677372] [drm] Skip scheduling IBs! [ 179.677375] [drm] Skip scheduling IBs! [ 179.677378] [drm] Skip scheduling IBs! [ 179.677380] [drm] Skip scheduling IBs! [ 179.677382] [drm] Skip scheduling IBs! [ 179.677383] [drm] Skip scheduling IBs! Pierre-Eric On 27/01/2020 20:37, Alex Deucher wrote: > Has been working fine for a while. > > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 1da03658891c..69248d1b2417 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -3762,6 +3762,9 @@ bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev) > case CHIP_VEGA12: > case CHIP_RAVEN: > case CHIP_ARCTURUS: > + case CHIP_NAVI10: > + case CHIP_NAVI14: > + case CHIP_NAVI12: > break; > default: > goto disabled; >
BEFORE reset ------------ # head pp_dpm_* ==> pp_dpm_dcefclk <== 0: 506Mhz * 1: 886Mhz 2: 1266Mhz ==> pp_dpm_fclk <== 0: 506Mhz 1: 950Mhz * 2: 1266Mhz ==> pp_dpm_mclk <== 0: 100Mhz 1: 500Mhz 2: 625Mhz 3: 875Mhz * ==> pp_dpm_pcie <== 0: 2.5GT/s, x16 619Mhz 1: 8.0GT/s, x16 619Mhz * ==> pp_dpm_sclk <== 0: 300Mhz 1: 800Mhz * 2: 1750Mhz ==> pp_dpm_socclk <== 0: 506Mhz 1: 950Mhz * 2: 1266Mhz AFTER reset ----------- # head pp_dpm_* ==> pp_dpm_dcefclk <== 0: 506Mhz * 1: 886Mhz 2: 1266Mhz ==> pp_dpm_fclk <== 0: 506Mhz * 1: 886Mhz 2: 1266Mhz ==> pp_dpm_mclk <== ==> pp_dpm_pcie <== 0: 2.5GT/s, x16 619Mhz 1: 8.0GT/s, x16 619Mhz * ==> pp_dpm_sclk <== 0: 0Mhz 1: 800Mhz * 2: 0Mhz ==> pp_dpm_socclk <== 0: 0Mhz 1: 506Mhz * 2: 0Mhz
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx