Hi, On 18.07.24 16:06, Alex Deucher wrote:
This adds preliminary support for GC per queue reset. In this case, only the jobs currently in the queue are lost. If this fails, we fall back to a full adapter reset.
First of all, thank you so much for working on this! It's great to finally see progress in making GPU resets better. I've just taken this patchset (together with your other patchsets[1][2][3]) for a quick spin on my Navi21 with the GPU reset tests[4] I had written a while ago - the current patchset sadly seems to have some regressions WRT recovery there. I ran the tests under my Plasma Wayland session once - this triggered a list double-add in drm_sched_stop (calltrace follows): ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447) ? do_trap (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154) ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:175) ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ? exc_invalid_op (arch/x86/kernel/traps.c:266) ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568) ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169 drivers/gpu/drm/scheduler/sched_main.c:617) amdgpu_device_gpu_recover (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808) amdgpu_job_timedout (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103) drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569) process_one_work (kernel/workqueue.c:2633) worker_thread (kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787 (discriminator 2)) ? __pfx_worker_thread (kernel/workqueue.c:2733) kthread (kernel/kthread.c:388) ? __pfx_kthread (kernel/kthread.c:341) ret_from_fork (arch/x86/kernel/process.c:147) ? __pfx_kthread (kernel/kthread.c:341) ret_from_fork_asm (arch/x86/entry/entry_64.S:251) When running the tests without a desktop environment active, the double-add disappeared, but the GPU reset still didn't go well - the TTY remained frozen and the kernel log contained a few messages like: [drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out which I guess means at least the display subsystem is hung. Hope this info is enough to repro/investigate. Thanks, Friedrich [1] https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deucher@xxxxxxx/T/#t [2] https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deucher@xxxxxxx/T/#t [3] https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-2e1e5788344f@xxxxxxx/T/#t [4] https://gitlab.steamos.cloud/holo/HangTestSuite
Alex Deucher (19): drm/amdgpu/mes: add API for legacy queue reset drm/amdgpu/mes11: add API for legacy queue reset drm/amdgpu/mes12: add API for legacy queue reset drm/amdgpu/mes: add API for user queue reset drm/amdgpu/mes11: add API for user queue reset drm/amdgpu/mes12: add API for user queue reset drm/amdgpu: add new ring reset callback drm/amdgpu: add per ring reset support (v2) drm/amdgpu/gfx11: add ring reset callbacks drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue() drm/amdgpu/gfx10: add ring reset callbacks drm/amdgpu/gfx10: rework reset sequence drm/amdgpu/gfx9: add ring reset callback drm/amdgpu/gfx9.4.3: add ring reset callback drm/amdgpu/gfx12: add ring reset callbacks drm/amdgpu/gfx12: fallback to driver reset compute queue directly drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL drm/amdgpu/gfx11: add a mutex for the gfx semaphore drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex() Jiadong Zhu (13): drm/amdgpu/gfx11: wait for reset done before remap drm/amdgpu/gfx10: remap queue after reset successfully drm/amdgpu/gfx10: wait for reset done before remap drm/amdgpu/gfx9: remap queue after reset successfully drm/amdgpu/gfx9: wait for reset done before remap drm/amdgpu/gfx9.4.3: remap queue after reset successfully drm/amdgpu/gfx_9.4.3: wait for reset done before remap drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue drm/amdgpu/gfx9: implement reset_hw_queue for gfx9 drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3 drm/amdgpu/mes: modify mes api for mmio queue reset drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio drm/amdgpu/mes11: implement mmio queue reset for gfx11 Prike Liang (2): drm/amdgpu: increase the reset counter for the queue reset drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2) drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 6 + drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 18 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 88 ++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 37 +++++ drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 158 ++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 117 +++++++++++++-- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h | 3 + drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 95 ++++++++++++- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 126 +++++++++++++++- drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 125 +++++++++++++++- drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 132 +++++++++++++++++ drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 54 +++++++ 14 files changed, 930 insertions(+), 32 deletions(-)