On 24.07.24 11:20, Zhu, Jiadong wrote:
[AMD Official Use Only - AMD Internal Distribution Only]
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Alex
Deucher
Sent: Friday, July 19, 2024 9:40 PM
To: Friedrich Vock <friedrich.vock@xxxxxx>
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; amd-
gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH 00/34] GC per queue reset
On Thu, Jul 18, 2024 at 1:00 PM Friedrich Vock <friedrich.vock@xxxxxx>
wrote:
Hi,
On 18.07.24 16:06, Alex Deucher wrote:
This adds preliminary support for GC per queue reset. In this case,
only the jobs currently in the queue are lost. If this fails, we
fall back to a full adapter reset.
First of all, thank you so much for working on this! It's great to
finally see progress in making GPU resets better.
I've just taken this patchset (together with your other
patchsets[1][2][3]) for a quick spin on my
Navi21 with the GPU reset tests[4] I had written a while ago - the
current patchset sadly seems to have some regressions WRT recovery
there.
I ran the tests under my Plasma Wayland session once - this triggered
a list double-add in drm_sched_stop (calltrace follows):
I think this should fix the double add:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 7107c4d3a3b6..555d3b671bdb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -88,6 +88,8 @@ static enum drm_gpu_sched_stat
amdgpu_job_timedout(struct drm_sched_job *s_job)
drm_sched_start(&ring->sched, true);
goto exit;
}
+ if (amdgpu_ring_sched_ready(ring))
+ drm_sched_start(&ring->sched, true);
}
if (amdgpu_device_should_recover_gpu(ring->adev)) {
? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434
arch/x86/kernel/dumpstack.c:447) ? do_trap
(arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154) ?
__list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ?
do_error_trap (./arch/x86/include/asm/traps.h:58
arch/x86/kernel/traps.c:175) ? __list_add_valid_or_report
(lib/list_debug.c:35 (discriminator 1)) ? exc_invalid_op
(arch/x86/kernel/traps.c:266) ? __list_add_valid_or_report
(lib/list_debug.c:35 (discriminator 1)) ? asm_exc_invalid_op
(./arch/x86/include/asm/idtentry.h:568)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ?
__list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169
drivers/gpu/drm/scheduler/sched_main.c:617)
amdgpu_device_gpu_recover
(drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808)
amdgpu_job_timedout
(drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103)
drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569)
process_one_work (kernel/workqueue.c:2633) worker_thread
(kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787
(discriminator 2)) ? __pfx_worker_thread (kernel/workqueue.c:2733)
kthread (kernel/kthread.c:388) ? __pfx_kthread (kernel/kthread.c:341)
ret_from_fork (arch/x86/kernel/process.c:147) ? __pfx_kthread
(kernel/kthread.c:341) ret_from_fork_asm
(arch/x86/entry/entry_64.S:251)
When running the tests without a desktop environment active, the
double-add disappeared, but the GPU reset still didn't go well - the
TTY remained frozen and the kernel log contained a few messages like:
[drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out
Hi Friedrich, we cannot reproduce the flip_done timed out on dgpu.
could you have a check if the hangtest runs on integrated gpu or the dgpu. If it runs on igpu, could you have a try to disable igpu in bios to see if it works. Thanks.
Hi,
I double-checked with the iGPU disabled in BIOS and can still reproduce.
In case it matters, note that I had a typo in my original message: I'm
testing on Navi22, not 21 - sorry about that.
Also, it seems like the issue also occurs on normal
amd-staging-drm-next without the per-queue reset patches, so this
actually an earlier, unrelated regression.
I'll try bisecting later and will open a separate GitLab issue for this.
Regards,
Friedrich
Thanks,
Jiadong
I don't think the display hardware is hung, I think it's a fence signalling issue
after the reset. We are investigating some limitations we are seeing in the
handling of fences.
which I guess means at least the display subsystem is hung.
Hope this info is enough to repro/investigate.
Thanks for testing!
Alex
Thanks,
Friedrich
[1]
https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deuch
er@xxxxxxx/T/#t [2]
https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deuch
er@xxxxxxx/T/#t [3]
https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-
2e1e5788344f@a
md.com/T/#t [4] https://gitlab.steamos.cloud/holo/HangTestSuite
Alex Deucher (19):
drm/amdgpu/mes: add API for legacy queue reset
drm/amdgpu/mes11: add API for legacy queue reset
drm/amdgpu/mes12: add API for legacy queue reset
drm/amdgpu/mes: add API for user queue reset
drm/amdgpu/mes11: add API for user queue reset
drm/amdgpu/mes12: add API for user queue reset
drm/amdgpu: add new ring reset callback
drm/amdgpu: add per ring reset support (v2)
drm/amdgpu/gfx11: add ring reset callbacks
drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
drm/amdgpu/gfx10: add ring reset callbacks
drm/amdgpu/gfx10: rework reset sequence
drm/amdgpu/gfx9: add ring reset callback
drm/amdgpu/gfx9.4.3: add ring reset callback
drm/amdgpu/gfx12: add ring reset callbacks
drm/amdgpu/gfx12: fallback to driver reset compute queue directly
drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
drm/amdgpu/gfx11: add a mutex for the gfx semaphore
drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
Jiadong Zhu (13):
drm/amdgpu/gfx11: wait for reset done before remap
drm/amdgpu/gfx10: remap queue after reset successfully
drm/amdgpu/gfx10: wait for reset done before remap
drm/amdgpu/gfx9: remap queue after reset successfully
drm/amdgpu/gfx9: wait for reset done before remap
drm/amdgpu/gfx9.4.3: remap queue after reset successfully
drm/amdgpu/gfx_9.4.3: wait for reset done before remap
drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for
reset_hw_queue
drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
drm/amdgpu/mes: modify mes api for mmio queue reset
drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
drm/amdgpu/mes11: implement mmio queue reset for gfx11
Prike Liang (2):
drm/amdgpu: increase the reset counter for the queue reset
drm/amdgpu/gfx11: fallback to driver reset compute queue directly
(v2)
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 6 +
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 18 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 88 ++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 37 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 +
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 158
++++++++++++++++++++-
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 117 +++++++++++++--
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h | 3 +
drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 95 ++++++++++++-
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 126 +++++++++++++++-
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 125
+++++++++++++++-
drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 132
+++++++++++++++++
drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 54 +++++++
14 files changed, 930 insertions(+), 32 deletions(-)