Re: [PATCH 00/34] GC per queue reset

Friedrich Vock <friedrich.vock@xxxxxx> · Thu, 25 Jul 2024 09:44:27 +0200

On 24.07.24 11:20, Zhu, Jiadong wrote:
[AMD Official Use Only - AMD Internal Distribution Only]

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Alex
Deucher
Sent: Friday, July 19, 2024 9:40 PM
To: Friedrich Vock <friedrich.vock@xxxxxx>
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; amd-
gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH 00/34] GC per queue reset

On Thu, Jul 18, 2024 at 1:00 PM Friedrich Vock <friedrich.vock@xxxxxx>
wrote:

Hi,

On 18.07.24 16:06, Alex Deucher wrote:
This adds preliminary support for GC per queue reset.  In this case,
only the jobs currently in the queue are lost.  If this fails, we
fall back to a full adapter reset.

First of all, thank you so much for working on this! It's great to
finally see progress in making GPU resets better.

I've just taken this patchset (together with your other
patchsets[1][2][3]) for a quick spin on my
Navi21 with the GPU reset tests[4] I had written a while ago - the
current patchset sadly seems to have some regressions WRT recovery
there.

I ran the tests under my Plasma Wayland session once - this triggered
a list double-add in drm_sched_stop (calltrace follows):

I think this should fix the double add:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 7107c4d3a3b6..555d3b671bdb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -88,6 +88,8 @@ static enum drm_gpu_sched_stat
amdgpu_job_timedout(struct drm_sched_job *s_job)
                                 drm_sched_start(&ring->sched, true);
                         goto exit;
                 }
+               if (amdgpu_ring_sched_ready(ring))
+                       drm_sched_start(&ring->sched, true);
         }

         if (amdgpu_device_should_recover_gpu(ring->adev)) {



? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434
arch/x86/kernel/dumpstack.c:447) ? do_trap
(arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154) ?
__list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ?
do_error_trap (./arch/x86/include/asm/traps.h:58
arch/x86/kernel/traps.c:175) ? __list_add_valid_or_report
(lib/list_debug.c:35 (discriminator 1)) ? exc_invalid_op
(arch/x86/kernel/traps.c:266) ? __list_add_valid_or_report
(lib/list_debug.c:35 (discriminator 1)) ? asm_exc_invalid_op
(./arch/x86/include/asm/idtentry.h:568)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ?
__list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169
drivers/gpu/drm/scheduler/sched_main.c:617)
amdgpu_device_gpu_recover
(drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808)
amdgpu_job_timedout
(drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103)
drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569)
process_one_work (kernel/workqueue.c:2633) worker_thread
(kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787
(discriminator 2)) ? __pfx_worker_thread (kernel/workqueue.c:2733)
kthread (kernel/kthread.c:388) ? __pfx_kthread (kernel/kthread.c:341)
ret_from_fork (arch/x86/kernel/process.c:147) ? __pfx_kthread
(kernel/kthread.c:341) ret_from_fork_asm
(arch/x86/entry/entry_64.S:251)

When running the tests without a desktop environment active, the
double-add disappeared, but the GPU reset still didn't go well - the
TTY remained frozen and the kernel log contained a few messages like:

[drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out

Hi Friedrich, we cannot reproduce the flip_done timed out on dgpu.
could you have a check if the hangtest runs on integrated gpu or the dgpu. If it runs on igpu, could you have a try to disable igpu in bios to see if it works. Thanks.

Hi,

I double-checked with the iGPU disabled in BIOS and can still reproduce.
In case it matters, note that I had a typo in my original message: I'm
testing on Navi22, not 21 - sorry about that.

Also, it seems like the issue also occurs on normal
amd-staging-drm-next without the per-queue reset patches, so this
actually an earlier, unrelated regression.

I'll try bisecting later and will open a separate GitLab issue for this.

Regards,
Friedrich


Thanks,
Jiadong

I don't think the display hardware is hung, I think it's a fence signalling issue
after the reset.  We are investigating some limitations we are seeing in the
handling of fences.


which I guess means at least the display subsystem is hung.

Hope this info is enough to repro/investigate.

Thanks for testing!

Alex


Thanks,
Friedrich

[1]
https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deuch
er@xxxxxxx/T/#t [2]
https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deuch
er@xxxxxxx/T/#t [3]
https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-
2e1e5788344f@a
md.com/T/#t [4] https://gitlab.steamos.cloud/holo/HangTestSuite



Alex Deucher (19):
    drm/amdgpu/mes: add API for legacy queue reset
    drm/amdgpu/mes11: add API for legacy queue reset
    drm/amdgpu/mes12: add API for legacy queue reset
    drm/amdgpu/mes: add API for user queue reset
    drm/amdgpu/mes11: add API for user queue reset
    drm/amdgpu/mes12: add API for user queue reset
    drm/amdgpu: add new ring reset callback
    drm/amdgpu: add per ring reset support (v2)
    drm/amdgpu/gfx11: add ring reset callbacks
    drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
    drm/amdgpu/gfx10: add ring reset callbacks
    drm/amdgpu/gfx10: rework reset sequence
    drm/amdgpu/gfx9: add ring reset callback
    drm/amdgpu/gfx9.4.3: add ring reset callback
    drm/amdgpu/gfx12: add ring reset callbacks
    drm/amdgpu/gfx12: fallback to driver reset compute queue directly
    drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
    drm/amdgpu/gfx11: add a mutex for the gfx semaphore
    drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()

Jiadong Zhu (13):
    drm/amdgpu/gfx11: wait for reset done before remap
    drm/amdgpu/gfx10: remap queue after reset successfully
    drm/amdgpu/gfx10: wait for reset done before remap
    drm/amdgpu/gfx9: remap queue after reset successfully
    drm/amdgpu/gfx9: wait for reset done before remap
    drm/amdgpu/gfx9.4.3: remap queue after reset successfully
    drm/amdgpu/gfx_9.4.3: wait for reset done before remap
    drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for
reset_hw_queue
    drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
    drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
    drm/amdgpu/mes: modify mes api for mmio queue reset
    drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
    drm/amdgpu/mes11: implement mmio queue reset for gfx11

Prike Liang (2):
    drm/amdgpu: increase the reset counter for the queue reset
    drm/amdgpu/gfx11: fallback to driver reset compute queue directly
(v2)

   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158
++++++++++++++++++++-
   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
   drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125
+++++++++++++++-
   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132
+++++++++++++++++
   drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
   14 files changed, 930 insertions(+), 32 deletions(-)