Hi Matt, > -----Original Message----- > From: Roper, Matthew D <matthew.d.roper@xxxxxxxxx> > Sent: Wednesday, October 16, 2024 5:23 AM > To: Gote, Nitin R <nitin.r.gote@xxxxxxxxx> > Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Shyti, Andi <andi.shyti@xxxxxxxxx>; Wilson, > Chris P <chris.p.wilson@xxxxxxxxx>; Das, Nirmoy <nirmoy.das@xxxxxxxxx>; > Chris Wilson <chris.p.wilson@xxxxxxxxxxxxxxx> > Subject: Re: [PATCH v4] drm/i915/gt: Retry RING_HEAD reset until it get sticks > > On Tue, Oct 15, 2024 at 08:27:10PM +0530, Nitin Gote wrote: > > we see an issue where resets fails because the engine resumes from an > > incorrect RING_HEAD. Since the RING_HEAD doesn't point to the > > remaining requests to re-run, but may instead point into the > > uninitialised portion of the ring, the GPU may be then fed invalid > > instructions from a privileged context, oft pushing the GPU into an > > unrecoverable hang. > > > > If at first the write doesn't succeed, try, try again. > > > > v2: Avoid unnecessary timeout macro (Andi) > > > > v3: Correct comment format (Andi) > > > > v4: Make it generic for all platform as it won't impact (Chris) > > > > Link: https://gitlab.freedesktop.org/drm/intel/-/issues/5432 > > Testcase: igt/i915_selftest/hangcheck > > The referenced HSW-specific gitlab issue was closed in 2022 and hadn't been > active for a while before that. This patch from Chris was originally posted as an > attachment on that gitlab issue asking if it helped, but nobody responded that it > did/didn't improve the situation so it may or may not have been relevant to > what was originally reported in that ticket. > > Looking in cibuglog, the most similar failures I see today are the ones getting > associated with issue #12310. I.e., > > <3> [220.415493] i915 0000:00:02.0: [drm] *ERROR* failed to set rcs0 > head to zero ctl 00000000 head 00001db8 tail 00000000 start 7fffa000 > > Are you trying to solve that CI issue or is there a different user-submitted report > somewhere that this patch is trying to address? > > > Matt > Yes. This patch is for https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12310 I will update the link. - Nitin > > > Signed-off-by: Chris Wilson <chris.p.wilson@xxxxxxxxxxxxxxx> > > Signed-off-by: Nitin Gote <nitin.r.gote@xxxxxxxxx> > > --- > > .../gpu/drm/i915/gt/intel_ring_submission.c | 31 ++++++++++++++++--- > > 1 file changed, 27 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c > > b/drivers/gpu/drm/i915/gt/intel_ring_submission.c > > index 72277bc8322e..b6b25fe22cb8 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c > > +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c > > @@ -192,6 +192,7 @@ static bool stop_ring(struct intel_engine_cs > > *engine) static int xcs_resume(struct intel_engine_cs *engine) { > > struct intel_ring *ring = engine->legacy.ring; > > + ktime_t kt; > > > > ENGINE_TRACE(engine, "ring:{HEAD:%04x, TAIL:%04x}\n", > > ring->head, ring->tail); > > @@ -230,9 +231,27 @@ static int xcs_resume(struct intel_engine_cs > *engine) > > set_pp_dir(engine); > > > > /* First wake the ring up to an empty/idle ring */ > > - ENGINE_WRITE_FW(engine, RING_HEAD, ring->head); > > + for ((kt) = ktime_get() + (2 * NSEC_PER_MSEC); > > + ktime_before(ktime_get(), (kt)); cpu_relax()) { > > + /* > > + * In case of resets fails because engine resumes from > > + * incorrect RING_HEAD and then GPU may be then fed > > + * to invalid instrcutions, which may lead to unrecoverable > > + * hang. So at first write doesn't succeed then try again. > > + */ > > + ENGINE_WRITE_FW(engine, RING_HEAD, ring->head); > > + if (ENGINE_READ_FW(engine, RING_HEAD) == ring->head) > > + break; > > + } > > + > > ENGINE_WRITE_FW(engine, RING_TAIL, ring->head); > > - ENGINE_POSTING_READ(engine, RING_TAIL); > > + if (ENGINE_READ_FW(engine, RING_HEAD) != > ENGINE_READ_FW(engine, RING_TAIL)) { > > + ENGINE_TRACE(engine, "failed to reset empty ring: [%x, %x]: > %x\n", > > + ENGINE_READ_FW(engine, RING_HEAD), > > + ENGINE_READ_FW(engine, RING_TAIL), > > + ring->head); > > + goto err; > > + } > > > > ENGINE_WRITE_FW(engine, RING_CTL, > > RING_CTL_SIZE(ring->size) | RING_VALID); @@ -241,12 > +260,16 @@ > > static int xcs_resume(struct intel_engine_cs *engine) > > if (__intel_wait_for_register_fw(engine->uncore, > > RING_CTL(engine->mmio_base), > > RING_VALID, RING_VALID, > > - 5000, 0, NULL)) > > + 5000, 0, NULL)) { > > + ENGINE_TRACE(engine, "failed to restart\n"); > > goto err; > > + } > > > > - if (GRAPHICS_VER(engine->i915) > 2) > > + if (GRAPHICS_VER(engine->i915) > 2) { > > ENGINE_WRITE_FW(engine, > > RING_MI_MODE, > _MASKED_BIT_DISABLE(STOP_RING)); > > + ENGINE_POSTING_READ(engine, RING_MI_MODE); > > + } > > > > /* Now awake, let it get started */ > > if (ring->tail != ring->head) { > > -- > > 2.25.1 > > > > -- > Matt Roper > Graphics Software Engineer > Linux GPU Platform Enablement > Intel Corporation