Alex Deucher <alexdeucher@xxxxxxxxx> writes: > On Thu, Jul 18, 2024 at 10:15 AM Alex Deucher <alexander.deucher@xxxxxxx> wrote: >> >> This adds preliminary support for GC per queue reset. In this >> case, only the jobs currently in the queue are lost. If this >> fails, we fall back to a full adapter reset. > > Also available here via git: > https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next-queue-reset Just tested this, after encountering the double-add crash trying to reset after a GPU hang. It doesn't seem to gracefully recover from this particular GPU hang, but at least now it resets properly. Still not going to attempt to run it against KDE / Plasma 6.1.3 on Arch, as that loves to hang if there's any Xwayland involved in the GPU reset event. However, under labwc-git with my own PR applied to it, it recovers okay, though Xwayland eventually crashes and is restarted by labwc. Here's a dmesg log excerpt of the reset and recovery event: [ 189.830630] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=52410, emitted seq=52412 [ 189.830642] amdgpu 0000:0a:00.0: amdgpu: Process information: process Stray-Win64-Shi pid 11560 thread vkd3d_queue pid 11719 [ 190.099191] amdgpu 0000:0a:00.0: amdgpu: GPU reset begin! [ 190.457702] amdgpu 0000:0a:00.0: amdgpu: Dumping IP State [ 190.459418] amdgpu 0000:0a:00.0: amdgpu: Dumping IP State Completed [ 190.459420] amdgpu 0000:0a:00.0: amdgpu: MODE1 reset [ 190.459423] amdgpu 0000:0a:00.0: amdgpu: GPU mode1 reset [ 190.459483] amdgpu 0000:0a:00.0: amdgpu: GPU smu mode1 reset [ 190.967464] amdgpu 0000:0a:00.0: amdgpu: GPU reset succeeded, trying to resume [ 190.967824] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000). [ 190.967912] [drm] VRAM is lost due to GPU reset! [ 190.967914] amdgpu 0000:0a:00.0: amdgpu: PSP is resuming... [ 191.042264] amdgpu 0000:0a:00.0: amdgpu: reserve 0xa00000 from 0x82fd000000 for PSP TMR [ 191.143003] amdgpu 0000:0a:00.0: amdgpu: RAS: optional ras ta ucode is not available [ 191.156566] amdgpu 0000:0a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available [ 191.156572] amdgpu 0000:0a:00.0: amdgpu: SMU is resuming... [ 191.156576] amdgpu 0000:0a:00.0: amdgpu: smu driver if version = 0x0000000e, smu fw if version = 0x00000012, smu fw program = 0, version = 0x00413e00 (65.62.0) [ 191.156580] amdgpu 0000:0a:00.0: amdgpu: SMU driver if version not matched [ 191.156609] amdgpu 0000:0a:00.0: amdgpu: use vbios provided pptable [ 191.215750] amdgpu 0000:0a:00.0: amdgpu: SMU is resumed successfully! [ 191.217023] [drm] DMUB hardware initialized: version=0x02020020 [ 191.530005] [drm] kiq ring mec 2 pipe 1 q 0 [ 191.532863] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 [ 191.532866] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0 [ 191.532867] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0 [ 191.532869] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0 [ 191.532870] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0 [ 191.532871] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0 [ 191.532872] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0 [ 191.532874] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0 [ 191.532875] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0 [ 191.532876] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0 [ 191.532878] amdgpu 0000:0a:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0 [ 191.532879] amdgpu 0000:0a:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0 [ 191.532880] amdgpu 0000:0a:00.0: amdgpu: ring sdma1 uses VM inv eng 14 on hub 0 [ 191.532881] amdgpu 0000:0a:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8 [ 191.532883] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8 [ 191.532884] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8 [ 191.532885] amdgpu 0000:0a:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8 [ 191.536522] amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow start [ 191.555443] amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow done [ 191.555471] amdgpu 0000:0a:00.0: amdgpu: GPU reset(2) succeeded! [ 191.555663] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125! Yes, I can reliably hang my gfx ring if I run Stray with -dx12 switch applied. In-game, though, not on the title screen. > Alex > >> >> Alex Deucher (19): >> drm/amdgpu/mes: add API for legacy queue reset >> drm/amdgpu/mes11: add API for legacy queue reset >> drm/amdgpu/mes12: add API for legacy queue reset >> drm/amdgpu/mes: add API for user queue reset >> drm/amdgpu/mes11: add API for user queue reset >> drm/amdgpu/mes12: add API for user queue reset >> drm/amdgpu: add new ring reset callback >> drm/amdgpu: add per ring reset support (v2) >> drm/amdgpu/gfx11: add ring reset callbacks >> drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue() >> drm/amdgpu/gfx10: add ring reset callbacks >> drm/amdgpu/gfx10: rework reset sequence >> drm/amdgpu/gfx9: add ring reset callback >> drm/amdgpu/gfx9.4.3: add ring reset callback >> drm/amdgpu/gfx12: add ring reset callbacks >> drm/amdgpu/gfx12: fallback to driver reset compute queue directly >> drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL >> drm/amdgpu/gfx11: add a mutex for the gfx semaphore >> drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex() >> >> Jiadong Zhu (13): >> drm/amdgpu/gfx11: wait for reset done before remap >> drm/amdgpu/gfx10: remap queue after reset successfully >> drm/amdgpu/gfx10: wait for reset done before remap >> drm/amdgpu/gfx9: remap queue after reset successfully >> drm/amdgpu/gfx9: wait for reset done before remap >> drm/amdgpu/gfx9.4.3: remap queue after reset successfully >> drm/amdgpu/gfx_9.4.3: wait for reset done before remap >> drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue >> drm/amdgpu/gfx9: implement reset_hw_queue for gfx9 >> drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3 >> drm/amdgpu/mes: modify mes api for mmio queue reset >> drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio >> drm/amdgpu/mes11: implement mmio queue reset for gfx11 >> >> Prike Liang (2): >> drm/amdgpu: increase the reset counter for the queue reset >> drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2) >> >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 6 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 18 +++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 88 ++++++++++++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 37 +++++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 + >> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 158 ++++++++++++++++++++- >> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 117 +++++++++++++-- >> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h | 3 + >> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 95 ++++++++++++- >> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 126 +++++++++++++++- >> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 125 +++++++++++++++- >> drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 132 +++++++++++++++++ >> drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 54 +++++++ >> 14 files changed, 930 insertions(+), 32 deletions(-) >> >> -- >> 2.45.2 >>