Re: [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Thu, 14 Jun 2018 19:39:23 +0100

Quoting Mika Kuoppala (2018-06-14 16:06:39)
> Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes:
> 
> > Hangcheck is our back up in case the GPU or the driver gets stuck. It
> > detects when the GPU is not making any progress and issues a GPU reset.
> > However, if the driver is failing to make any progress, we can get
> > ourselves into a situation where we continually try resetting the GPU to
> > no avail. Employ a second timeout such that if we continue to see the
> > same seqno (the stalled engine has made no progress at all) over the
> > course of several hangchecks, declare the driver wedged and attempt to
> > start afresh.
> >
> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxx>
> > ---
> >  drivers/gpu/drm/i915/i915_debugfs.c     |  5 +++--
> >  drivers/gpu/drm/i915/i915_drv.h         |  2 ++
> >  drivers/gpu/drm/i915/intel_hangcheck.c  | 17 ++++++++++++++++-
> >  drivers/gpu/drm/i915/intel_ringbuffer.h |  3 ++-
> >  4 files changed, 23 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> > index 82d06a03b22f..7e68684be1e4 100644
> > --- a/drivers/gpu/drm/i915/i915_debugfs.c
> > +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> > @@ -1362,11 +1362,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
> >               seq_printf(m, "    seqno = %x [current %x, last %x]\n",
> >                          engine->hangcheck.seqno, seqno[id],
> >                          intel_engine_last_submit(engine));
> > -             seq_printf(m, "    waiters? %s, fake irq active? %s, stalled? %s\n",
> > +             seq_printf(m, "    waiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n",
> >                          yesno(intel_engine_has_waiter(engine)),
> >                          yesno(test_bit(engine->id,
> >                                         &dev_priv->gpu_error.missed_irq_rings)),
> > -                        yesno(engine->hangcheck.stalled));
> > +                        yesno(engine->hangcheck.stalled),
> > +                        yesno(engine->hangcheck.wedged));
> >  
> >               spin_lock_irq(&b->rb_lock);
> >               for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) {
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index 6649962e991a..d254a29a59a0 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -909,6 +909,8 @@ struct i915_gem_mm {
> >  #define I915_ENGINE_DEAD_TIMEOUT  (4 * HZ)  /* Seqno, head and subunits dead */
> >  #define I915_SEQNO_DEAD_TIMEOUT   (12 * HZ) /* Seqno dead with active head */
> >  
> > +#define I915_ENGINE_WEDGED_TIMEOUT  (60 * HZ)  /* Reset but no recovery? */
> > +
> >  enum modeset_restore {
> >       MODESET_ON_LID_OPEN,
> >       MODESET_DONE,
> > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> > index d47e346bd49e..2fc7a0dd0df9 100644
> > --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> > @@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine,
> >       engine->hangcheck.seqno = hc->seqno;
> >       engine->hangcheck.action = hc->action;
> >       engine->hangcheck.stalled = hc->stalled;
> > +     engine->hangcheck.wedged = hc->wedged;
> >  }
> >  
> >  static enum intel_engine_hangcheck_action
> > @@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
> >  
> >       hc->stalled = time_after(jiffies,
> >                                engine->hangcheck.action_timestamp + timeout);
> > +     hc->wedged = time_after(jiffies,
> > +                              engine->hangcheck.action_timestamp +
> > +                              I915_ENGINE_WEDGED_TIMEOUT);
> 
> I was concerned that some callpath does end up zeroing
> hangcheck->seqno through intel_engine_init_hangcheck.
> 
> It seems that seqno wrap and unparking are the
> only ones, with exception of setup paths that
> does this.
> 
> But as both wrap and unpark are done with idle
> precondition, this should only at worst bring
> one extra tick to reload the seqnos.
> 
> So I can't poke holes in this and it should
> prevent our driver mistakes ending up in
> endless resets loops, like the tin says.
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>

Thanks, it's just meant to be a safety net against our (driver) bugs, so
I don't mind it being coarse. Just loud and abrasive.

Pushed,
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx