Re: [PATCH 3/5] drm/i915/execlists: Process interrupted context on reset

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Wed, 17 Jul 2019 14:40:26 +0100



Quoting Tvrtko Ursulin (2019-07-17 14:31:00)
> 
> On 16/07/2019 13:49, Chris Wilson wrote:
> > By stopping the rings, we may trigger an arbitration point resulting in
> > a premature context-switch (i.e. a completion event before the request
> > is actually complete). This clears the active context before the reset,
> > but we must remember to rewind the incomplete context for replay upon
> > resume.
> > 
> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_lrc.c | 6 ++++--
> >   1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > index 9b87a2fc186c..7570a9256001 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > @@ -1419,7 +1419,8 @@ static void process_csb(struct intel_engine_cs *engine)
> >                        * coherent (visible from the CPU) before the
> >                        * user interrupt and CSB is processed.
> >                        */
> > -                     GEM_BUG_ON(!i915_request_completed(*execlists->active));
> > +                     GEM_BUG_ON(!i915_request_completed(*execlists->active) &&
> > +                                !reset_in_progress(execlists));
> >                       execlists_schedule_out(*execlists->active++);
> >   
> >                       GEM_BUG_ON(execlists->active - execlists->inflight >
> > @@ -2254,7 +2255,7 @@ static void __execlists_reset(struct intel_engine_cs *engine, bool stalled)
> >        */
> >       rq = execlists_active(execlists);
> >       if (!rq)
> > -             return;
> > +             goto unwind;
> >   
> >       ce = rq->hw_context;
> >       GEM_BUG_ON(i915_active_is_idle(&ce->active));
> > @@ -2331,6 +2332,7 @@ static void __execlists_reset(struct intel_engine_cs *engine, bool stalled)
> >       intel_ring_update_space(ce->ring);
> >       __execlists_update_reg_state(ce, engine);
> >   
> > +unwind:
> >       /* Push back any incomplete requests for replay after the reset. */
> >       __unwind_incomplete_requests(engine);
> >   }
> > 
> 
> Sounds plausible.
> 
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> 
> Shouldn't there be a Fixes: tag to go with it?

Yeah, it's rare even by our standards, I think there's a live_hangcheck
failure about once a month that could be the result of this. However,
the result would be an unrecoverable GPU hang as each attempt at
resetting would not see the missing request and so it would remain
perpetually in the engine->active.list until a set-wedged (i.e. suspend
in the user case).
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx