[AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Alex > Deucher > Sent: Friday, July 19, 2024 9:40 PM > To: Friedrich Vock <friedrich.vock@xxxxxx> > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH 00/34] GC per queue reset > > On Thu, Jul 18, 2024 at 1:00 PM Friedrich Vock <friedrich.vock@xxxxxx> > wrote: > > > > Hi, > > > > On 18.07.24 16:06, Alex Deucher wrote: > > > This adds preliminary support for GC per queue reset. In this case, > > > only the jobs currently in the queue are lost. If this fails, we > > > fall back to a full adapter reset. > > > > First of all, thank you so much for working on this! It's great to > > finally see progress in making GPU resets better. > > > > I've just taken this patchset (together with your other > > patchsets[1][2][3]) for a quick spin on my > > Navi21 with the GPU reset tests[4] I had written a while ago - the > > current patchset sadly seems to have some regressions WRT recovery > there. > > > > I ran the tests under my Plasma Wayland session once - this triggered > > a list double-add in drm_sched_stop (calltrace follows): > > I think this should fix the double add: > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > index 7107c4d3a3b6..555d3b671bdb 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > @@ -88,6 +88,8 @@ static enum drm_gpu_sched_stat > amdgpu_job_timedout(struct drm_sched_job *s_job) > drm_sched_start(&ring->sched, true); > goto exit; > } > + if (amdgpu_ring_sched_ready(ring)) > + drm_sched_start(&ring->sched, true); > } > > if (amdgpu_device_should_recover_gpu(ring->adev)) { > > > > > > ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 > > arch/x86/kernel/dumpstack.c:447) ? do_trap > > (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154) ? > > __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ? > > do_error_trap (./arch/x86/include/asm/traps.h:58 > > arch/x86/kernel/traps.c:175) ? __list_add_valid_or_report > > (lib/list_debug.c:35 (discriminator 1)) ? exc_invalid_op > > (arch/x86/kernel/traps.c:266) ? __list_add_valid_or_report > > (lib/list_debug.c:35 (discriminator 1)) ? asm_exc_invalid_op > > (./arch/x86/include/asm/idtentry.h:568) > > ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ? > > __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) > > drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169 > > drivers/gpu/drm/scheduler/sched_main.c:617) > > amdgpu_device_gpu_recover > > (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808) > > amdgpu_job_timedout > (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103) > > drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569) > > process_one_work (kernel/workqueue.c:2633) worker_thread > > (kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787 > > (discriminator 2)) ? __pfx_worker_thread (kernel/workqueue.c:2733) > > kthread (kernel/kthread.c:388) ? __pfx_kthread (kernel/kthread.c:341) > > ret_from_fork (arch/x86/kernel/process.c:147) ? __pfx_kthread > > (kernel/kthread.c:341) ret_from_fork_asm > > (arch/x86/entry/entry_64.S:251) > > > > When running the tests without a desktop environment active, the > > double-add disappeared, but the GPU reset still didn't go well - the > > TTY remained frozen and the kernel log contained a few messages like: > > > > [drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out Hi Friedrich, we cannot reproduce the flip_done timed out on dgpu. could you have a check if the hangtest runs on integrated gpu or the dgpu. If it runs on igpu, could you have a try to disable igpu in bios to see if it works. Thanks. Thanks, Jiadong > I don't think the display hardware is hung, I think it's a fence signalling issue > after the reset. We are investigating some limitations we are seeing in the > handling of fences. > > > > > which I guess means at least the display subsystem is hung. > > > > Hope this info is enough to repro/investigate. > > Thanks for testing! > > Alex > > > > > Thanks, > > Friedrich > > > > [1] > > https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deuch > > er@xxxxxxx/T/#t [2] > > https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deuch > > er@xxxxxxx/T/#t [3] > > https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789- > 2e1e5788344f@a > > md.com/T/#t [4] https://gitlab.steamos.cloud/holo/HangTestSuite > > > > > > > > > > Alex Deucher (19): > > > drm/amdgpu/mes: add API for legacy queue reset > > > drm/amdgpu/mes11: add API for legacy queue reset > > > drm/amdgpu/mes12: add API for legacy queue reset > > > drm/amdgpu/mes: add API for user queue reset > > > drm/amdgpu/mes11: add API for user queue reset > > > drm/amdgpu/mes12: add API for user queue reset > > > drm/amdgpu: add new ring reset callback > > > drm/amdgpu: add per ring reset support (v2) > > > drm/amdgpu/gfx11: add ring reset callbacks > > > drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue() > > > drm/amdgpu/gfx10: add ring reset callbacks > > > drm/amdgpu/gfx10: rework reset sequence > > > drm/amdgpu/gfx9: add ring reset callback > > > drm/amdgpu/gfx9.4.3: add ring reset callback > > > drm/amdgpu/gfx12: add ring reset callbacks > > > drm/amdgpu/gfx12: fallback to driver reset compute queue directly > > > drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL > > > drm/amdgpu/gfx11: add a mutex for the gfx semaphore > > > drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex() > > > > > > Jiadong Zhu (13): > > > drm/amdgpu/gfx11: wait for reset done before remap > > > drm/amdgpu/gfx10: remap queue after reset successfully > > > drm/amdgpu/gfx10: wait for reset done before remap > > > drm/amdgpu/gfx9: remap queue after reset successfully > > > drm/amdgpu/gfx9: wait for reset done before remap > > > drm/amdgpu/gfx9.4.3: remap queue after reset successfully > > > drm/amdgpu/gfx_9.4.3: wait for reset done before remap > > > drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for > reset_hw_queue > > > drm/amdgpu/gfx9: implement reset_hw_queue for gfx9 > > > drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3 > > > drm/amdgpu/mes: modify mes api for mmio queue reset > > > drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio > > > drm/amdgpu/mes11: implement mmio queue reset for gfx11 > > > > > > Prike Liang (2): > > > drm/amdgpu: increase the reset counter for the queue reset > > > drm/amdgpu/gfx11: fallback to driver reset compute queue directly > > > (v2) > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 + > > > drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 6 + > > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 18 +++ > > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 88 ++++++++++++ > > > drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 37 +++++ > > > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 + > > > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 158 > ++++++++++++++++++++- > > > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 117 +++++++++++++-- > > > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h | 3 + > > > drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 95 ++++++++++++- > > > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 126 +++++++++++++++- > > > drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 125 > +++++++++++++++- > > > drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 132 > +++++++++++++++++ > > > drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 54 +++++++ > > > 14 files changed, 930 insertions(+), 32 deletions(-) > > > > > > > > >