Re: [PATCH 00/34] GC per queue reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 18.07.24 16:06, Alex Deucher wrote:
This adds preliminary support for GC per queue reset.  In this
case, only the jobs currently in the queue are lost.  If this
fails, we fall back to a full adapter reset.

First of all, thank you so much for working on this! It's great to
finally see progress in making GPU resets better.

I've just taken this patchset (together with your other
patchsets[1][2][3]) for a quick spin on my
Navi21 with the GPU reset tests[4] I had written a while ago - the
current patchset sadly seems to have some regressions WRT recovery there.

I ran the tests under my Plasma Wayland session once - this triggered a
list double-add in drm_sched_stop (calltrace follows):

? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
? do_trap (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:175)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? exc_invalid_op (arch/x86/kernel/traps.c:266)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169 drivers/gpu/drm/scheduler/sched_main.c:617)
amdgpu_device_gpu_recover (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808)
amdgpu_job_timedout (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103)
drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569)
process_one_work (kernel/workqueue.c:2633)
worker_thread (kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787 (discriminator 2))
? __pfx_worker_thread (kernel/workqueue.c:2733)
kthread (kernel/kthread.c:388)
? __pfx_kthread (kernel/kthread.c:341)
ret_from_fork (arch/x86/kernel/process.c:147)
? __pfx_kthread (kernel/kthread.c:341)
ret_from_fork_asm (arch/x86/entry/entry_64.S:251)

When running the tests without a desktop environment active, the
double-add disappeared, but the GPU reset still didn't go well - the TTY
remained frozen and the kernel log contained a few messages like:

[drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out

which I guess means at least the display subsystem is hung.

Hope this info is enough to repro/investigate.

Thanks,
Friedrich

[1] https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deucher@xxxxxxx/T/#t
[2] https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deucher@xxxxxxx/T/#t
[3] https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-2e1e5788344f@xxxxxxx/T/#t
[4] https://gitlab.steamos.cloud/holo/HangTestSuite



Alex Deucher (19):
   drm/amdgpu/mes: add API for legacy queue reset
   drm/amdgpu/mes11: add API for legacy queue reset
   drm/amdgpu/mes12: add API for legacy queue reset
   drm/amdgpu/mes: add API for user queue reset
   drm/amdgpu/mes11: add API for user queue reset
   drm/amdgpu/mes12: add API for user queue reset
   drm/amdgpu: add new ring reset callback
   drm/amdgpu: add per ring reset support (v2)
   drm/amdgpu/gfx11: add ring reset callbacks
   drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
   drm/amdgpu/gfx10: add ring reset callbacks
   drm/amdgpu/gfx10: rework reset sequence
   drm/amdgpu/gfx9: add ring reset callback
   drm/amdgpu/gfx9.4.3: add ring reset callback
   drm/amdgpu/gfx12: add ring reset callbacks
   drm/amdgpu/gfx12: fallback to driver reset compute queue directly
   drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
   drm/amdgpu/gfx11: add a mutex for the gfx semaphore
   drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()

Jiadong Zhu (13):
   drm/amdgpu/gfx11: wait for reset done before remap
   drm/amdgpu/gfx10: remap queue after reset successfully
   drm/amdgpu/gfx10: wait for reset done before remap
   drm/amdgpu/gfx9: remap queue after reset successfully
   drm/amdgpu/gfx9: wait for reset done before remap
   drm/amdgpu/gfx9.4.3: remap queue after reset successfully
   drm/amdgpu/gfx_9.4.3: wait for reset done before remap
   drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
   drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
   drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
   drm/amdgpu/mes: modify mes api for mmio queue reset
   drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
   drm/amdgpu/mes11: implement mmio queue reset for gfx11

Prike Liang (2):
   drm/amdgpu: increase the reset counter for the queue reset
   drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)

  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158 ++++++++++++++++++++-
  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
  drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125 +++++++++++++++-
  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132 +++++++++++++++++
  drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
  14 files changed, 930 insertions(+), 32 deletions(-)








[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux