Re: [PATCH] drm/amdgpu: Fix gfx10 kiq ring_lock warning on full reset

Alex Deucher <alexdeucher@xxxxxxxxx> · Mon, 22 Jul 2024 15:36:31 -0400

On Mon, Jul 22, 2024 at 4:16 AM Jesse Zhang <jesse.zhang@xxxxxxx> wrote:
>
> Fix warning about kiq ring.
> Unlock kiq ring when queue reset fails.
>
> [  285.999224] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
> [  312.018425] watchdog: BUG: soft lockup - CPU#11 stuck for 26s! [kworker/u64:2:878]
> [  312.018428] Modules linked in: amdgpu(E) amdxcp drm_exec gpu_sched drm_buddy drm_suballoc_helper drm_ttm_helper ttm drm_display_helper cec rc_core drm_kms_helper i2c_algo_bit rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs xt_conntrack nft_chain_nat r8153_ecm cdc_ether usbnet xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_multiport xt_addrtype nft_compat nf_tables br_netfilter libcrc32c nfnetlink bridge stp llc r8152 mii joydev input_leds overlay snd_hda_codec_hdmi edac_mce_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm_amd snd_hda_codec snd_hda_core snd_hwdep kvm hid_generic snd_pcm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel snd_seq_midi snd_seq_midi_event snd_rawmidi crypto_simd usbhid cryptd hid snd_seq snd_pci_acp5x snd_seq_device snd_timer snd_rn_pci_acp3x rapl snd_acp_config snd_soc_acpi snd ccp snd_pci_acp3x wmi_bmof soundcore k10temp mac_hid sunrpc binfmt_misc sch_fq_codel msr parport_pc
> [  312.018466]  ppdev lp drm parport efi_pstore ip_tables x_tables autofs4 ucsi_ccg typec_ucsi typec nvme crc32_pclmul nvme_core xhci_pci i2c_designware_pci i2c_piix4 xhci_pci_renesas i2c_ccgx_ucsi video wmi
> [  312.018475] CPU: 11 PID: 878 Comm: kworker/u64:2 Tainted: G            E      6.8.0+ #171
> [  312.018477] Hardware name: AMD Splinter/Splinter-GNR, BIOS WS54117N_140 01/16/2024
> [  312.018478] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
> [  312.018485] RIP: 0010:native_queued_spin_lock_slowpath+0x88/0x300
> [  312.018490] Code: 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 41 8b 04 24 30 e4 09 d0 a9 00 01 ff ff 75 5e 85 c0 74 14 41 0f b6 04 24 84 c0 74 0b f3 90 <41> 0f b6 04 24 84 c0 75 f5 b8 01 00 00 00 66 41 89 04 24 5b 41 5c
> [  312.018492] RSP: 0018:ffffa327c0de7b80 EFLAGS: 00000202
> [  312.018493] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
> [  312.018494] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ab913e16cf8
> [  312.018495] RBP: ffffa327c0de7ba8 R08: 0000000000000000 R09: fffffa4007040000
> [  312.018495] R10: ffffa327c0de7bb8 R11: 0000000000000040 R12: ffff8ab913e16cf8
> [  312.018496] R13: ffff8ab913e00000 R14: ffff8ab913e00000 R15: ffff8ab913e00000
> [  312.018497] FS:  0000000000000000(0000) GS:ffff8ab9956c0000(0000) knlGS:0000000000000000
> [  312.018498] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  312.018498] CR2: 00007f44b24d319c CR3: 000000023b83c000 CR4: 0000000000750ef0
> [  312.018499] PKRU: 55555554
> [  312.018500] Call Trace:
> [  312.018501]  <IRQ>
> [  312.018504]  ? show_regs+0x6c/0x80
> [  312.018508]  ? watchdog_timer_fn+0x206/0x290
> [  312.018511]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [  312.018513]  ? __hrtimer_run_queues+0xc8/0x220
> [  312.018517]  ? hrtimer_interrupt+0x10d/0x250
> [  312.018519]  ? __sysvec_apic_timer_interrupt+0x51/0x130
> [  312.018522]  ? sysvec_apic_timer_interrupt+0x7f/0x90
> [  312.018525]  </IRQ>
> [  312.018525]  <TASK>
> [  312.018526]  ? asm_sysvec_apic_timer_interrupt+0x1f/0x30
> [  312.018529]  ? native_queued_spin_lock_slowpath+0x88/0x300
> [  312.018530]  _raw_spin_lock+0x2d/0x40
> [  312.018532]  amdgpu_gfx_disable_kgq+0x6f/0x1d0 [amdgpu]
> [  312.018646]  gfx_v10_0_hw_fini+0x111/0x130 [amdgpu]
> [  312.018742]  gfx_v10_0_suspend+0x12/0x20 [amdgpu]
> [  312.018832]  amdgpu_device_ip_suspend_phase2+0x244/0x470 [amdgpu]
> [  312.018909]  amdgpu_device_ip_suspend+0x4b/0x90 [amdgpu]
> [  312.018989]  amdgpu_device_pre_asic_reset+0xda/0x4b0 [amdgpu]
> [  312.019068]  amdgpu_device_gpu_recover+0x319/0xe20 [amdgpu]
> [  312.019147]  amdgpu_job_timedout+0x177/0x280 [amdgpu]
> [  312.019266]  drm_sched_job_timedout+0x7c/0x100 [gpu_sched]
> [  312.019269]  process_scheduled_works+0x9a/0x3a0
> [  312.019272]  ? __pfx_worker_thread+0x10/0x10
> [  312.019273]  worker_thread+0x15f/0x2d0
> [  312.019275]  ? __pfx_worker_thread+0x10/0x10
> [  312.019276]  kthread+0xfb/0x130
> [  312.019277]  ? __pfx_kthread+0x10/0x10
> [  312.019278]  ret_from_fork+0x3d/0x60
> [  312.019280]  ? __pfx_kthread+0x10/0x10
> [  312.019281]  ret_from_fork_asm+0x1b/0x30
> [  312.019284]  </TASK>
>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx>
> Signed-off-by: Jesse Zhang <Jesse.Zhang@xxxxxxx>

Good catch.  I've squashed this into the appropriate patch and pushed
and updated branch here:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next-queue-reset

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index fde11159270c..59024fbf8c22 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -9478,6 +9478,7 @@ static int gfx_v10_0_reset_compute_ring(struct amdgpu_ring *ring,
>                                    0, 0);
>         amdgpu_ring_commit(kiq_ring);
>
> +       spin_unlock_irqrestore(&kiq->ring_lock, flags);
>         r = amdgpu_ring_test_ring(kiq_ring);
>         if (r)
>                 return r;
> @@ -9530,8 +9531,6 @@ static int gfx_v10_0_reset_compute_ring(struct amdgpu_ring *ring,
>         if (r)
>                 return r;
>
> -       spin_unlock_irqrestore(&kiq->ring_lock, flags);
> -
>         return amdgpu_ring_test_ring(ring);
>  }
>
> --
> 2.25.1
>