Re: [PATCH v4] drm/i915/gt: Retry RING_HEAD reset until it get sticks

Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx> · Thu, 17 Oct 2024 10:36:46 +0200

Hi Nitin,

> > > we see an issue where resets fails because the engine resumes from an
> > > incorrect RING_HEAD. Since the RING_HEAD doesn't point to the
> > > remaining requests to re-run, but may instead point into the
> > > uninitialised portion of the ring, the GPU may be then fed invalid
> > > instructions from a privileged context, oft pushing the GPU into an
> > > unrecoverable hang.
> > >
> > > If at first the write doesn't succeed, try, try again.
> > >
> > > v2: Avoid unnecessary timeout macro (Andi)
> > >
> > > v3: Correct comment format (Andi)
> > >
> > > v4: Make it generic for all platform as it won't impact (Chris)
> > >
> > > Link: https://gitlab.freedesktop.org/drm/intel/-/issues/5432
> > > Testcase: igt/i915_selftest/hangcheck
> > 
> > The referenced HSW-specific gitlab issue was closed in 2022 and hadn't been
> > active for a while before that.  This patch from Chris was originally posted as an
> > attachment on that gitlab issue asking if it helped, but nobody responded that it
> > did/didn't improve the situation so it may or may not have been relevant to
> > what was originally reported in that ticket.
> > 
> > Looking in cibuglog, the most similar failures I see today are the ones getting
> > associated with issue #12310.  I.e.,
> > 
> >   <3> [220.415493] i915 0000:00:02.0: [drm] *ERROR* failed to set rcs0
> >   head to zero ctl 00000000 head 00001db8 tail 00000000 start 7fffa000
> > 
> > Are you trying to solve that CI issue or is there a different user-submitted report
> > somewhere that this patch is trying to address?
> > 
> > 
> > Matt
> > 
> 
> Yes. This patch is for https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12310
> I will update the link.

No worries, I can update the link here.

Reviewed-by: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx>

Thanks,
Andi