On Mon, Jul 22, 2024 at 4:16 AM Jesse Zhang <jesse.zhang@xxxxxxx> wrote: > > Fix warning about kiq ring. > Unlock kiq ring when queue reset fails. > > [ 285.999224] amdgpu 0000:03:00.0: amdgpu: GPU reset begin! > [ 312.018425] watchdog: BUG: soft lockup - CPU#11 stuck for 26s! [kworker/u64:2:878] > [ 312.018428] Modules linked in: amdgpu(E) amdxcp drm_exec gpu_sched drm_buddy drm_suballoc_helper drm_ttm_helper ttm drm_display_helper cec rc_core drm_kms_helper i2c_algo_bit rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs xt_conntrack nft_chain_nat r8153_ecm cdc_ether usbnet xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_multiport xt_addrtype nft_compat nf_tables br_netfilter libcrc32c nfnetlink bridge stp llc r8152 mii joydev input_leds overlay snd_hda_codec_hdmi edac_mce_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm_amd snd_hda_codec snd_hda_core snd_hwdep kvm hid_generic snd_pcm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel snd_seq_midi snd_seq_midi_event snd_rawmidi crypto_simd usbhid cryptd hid snd_seq snd_pci_acp5x snd_seq_device snd_timer snd_rn_pci_acp3x rapl snd_acp_config snd_soc_acpi snd ccp snd_pci_acp3x wmi_bmof soundcore k10temp mac_hid sunrpc binfmt_misc sch_fq_codel msr parport_pc > [ 312.018466] ppdev lp drm parport efi_pstore ip_tables x_tables autofs4 ucsi_ccg typec_ucsi typec nvme crc32_pclmul nvme_core xhci_pci i2c_designware_pci i2c_piix4 xhci_pci_renesas i2c_ccgx_ucsi video wmi > [ 312.018475] CPU: 11 PID: 878 Comm: kworker/u64:2 Tainted: G E 6.8.0+ #171 > [ 312.018477] Hardware name: AMD Splinter/Splinter-GNR, BIOS WS54117N_140 01/16/2024 > [ 312.018478] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched] > [ 312.018485] RIP: 0010:native_queued_spin_lock_slowpath+0x88/0x300 > [ 312.018490] Code: 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 41 8b 04 24 30 e4 09 d0 a9 00 01 ff ff 75 5e 85 c0 74 14 41 0f b6 04 24 84 c0 74 0b f3 90 <41> 0f b6 04 24 84 c0 75 f5 b8 01 00 00 00 66 41 89 04 24 5b 41 5c > [ 312.018492] RSP: 0018:ffffa327c0de7b80 EFLAGS: 00000202 > [ 312.018493] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 > [ 312.018494] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ab913e16cf8 > [ 312.018495] RBP: ffffa327c0de7ba8 R08: 0000000000000000 R09: fffffa4007040000 > [ 312.018495] R10: ffffa327c0de7bb8 R11: 0000000000000040 R12: ffff8ab913e16cf8 > [ 312.018496] R13: ffff8ab913e00000 R14: ffff8ab913e00000 R15: ffff8ab913e00000 > [ 312.018497] FS: 0000000000000000(0000) GS:ffff8ab9956c0000(0000) knlGS:0000000000000000 > [ 312.018498] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 312.018498] CR2: 00007f44b24d319c CR3: 000000023b83c000 CR4: 0000000000750ef0 > [ 312.018499] PKRU: 55555554 > [ 312.018500] Call Trace: > [ 312.018501] <IRQ> > [ 312.018504] ? show_regs+0x6c/0x80 > [ 312.018508] ? watchdog_timer_fn+0x206/0x290 > [ 312.018511] ? __pfx_watchdog_timer_fn+0x10/0x10 > [ 312.018513] ? __hrtimer_run_queues+0xc8/0x220 > [ 312.018517] ? hrtimer_interrupt+0x10d/0x250 > [ 312.018519] ? __sysvec_apic_timer_interrupt+0x51/0x130 > [ 312.018522] ? sysvec_apic_timer_interrupt+0x7f/0x90 > [ 312.018525] </IRQ> > [ 312.018525] <TASK> > [ 312.018526] ? asm_sysvec_apic_timer_interrupt+0x1f/0x30 > [ 312.018529] ? native_queued_spin_lock_slowpath+0x88/0x300 > [ 312.018530] _raw_spin_lock+0x2d/0x40 > [ 312.018532] amdgpu_gfx_disable_kgq+0x6f/0x1d0 [amdgpu] > [ 312.018646] gfx_v10_0_hw_fini+0x111/0x130 [amdgpu] > [ 312.018742] gfx_v10_0_suspend+0x12/0x20 [amdgpu] > [ 312.018832] amdgpu_device_ip_suspend_phase2+0x244/0x470 [amdgpu] > [ 312.018909] amdgpu_device_ip_suspend+0x4b/0x90 [amdgpu] > [ 312.018989] amdgpu_device_pre_asic_reset+0xda/0x4b0 [amdgpu] > [ 312.019068] amdgpu_device_gpu_recover+0x319/0xe20 [amdgpu] > [ 312.019147] amdgpu_job_timedout+0x177/0x280 [amdgpu] > [ 312.019266] drm_sched_job_timedout+0x7c/0x100 [gpu_sched] > [ 312.019269] process_scheduled_works+0x9a/0x3a0 > [ 312.019272] ? __pfx_worker_thread+0x10/0x10 > [ 312.019273] worker_thread+0x15f/0x2d0 > [ 312.019275] ? __pfx_worker_thread+0x10/0x10 > [ 312.019276] kthread+0xfb/0x130 > [ 312.019277] ? __pfx_kthread+0x10/0x10 > [ 312.019278] ret_from_fork+0x3d/0x60 > [ 312.019280] ? __pfx_kthread+0x10/0x10 > [ 312.019281] ret_from_fork_asm+0x1b/0x30 > [ 312.019284] </TASK> > > Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx> > Signed-off-by: Jesse Zhang <Jesse.Zhang@xxxxxxx> Good catch. I've squashed this into the appropriate patch and pushed and updated branch here: https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next-queue-reset Alex > --- > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > index fde11159270c..59024fbf8c22 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > @@ -9478,6 +9478,7 @@ static int gfx_v10_0_reset_compute_ring(struct amdgpu_ring *ring, > 0, 0); > amdgpu_ring_commit(kiq_ring); > > + spin_unlock_irqrestore(&kiq->ring_lock, flags); > r = amdgpu_ring_test_ring(kiq_ring); > if (r) > return r; > @@ -9530,8 +9531,6 @@ static int gfx_v10_0_reset_compute_ring(struct amdgpu_ring *ring, > if (r) > return r; > > - spin_unlock_irqrestore(&kiq->ring_lock, flags); > - > return amdgpu_ring_test_ring(ring); > } > > -- > 2.25.1 >