Quoting Mika Kuoppala (2018-06-14 16:06:39) > Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > > > Hangcheck is our back up in case the GPU or the driver gets stuck. It > > detects when the GPU is not making any progress and issues a GPU reset. > > However, if the driver is failing to make any progress, we can get > > ourselves into a situation where we continually try resetting the GPU to > > no avail. Employ a second timeout such that if we continue to see the > > same seqno (the stalled engine has made no progress at all) over the > > course of several hangchecks, declare the driver wedged and attempt to > > start afresh. > > > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxx> > > --- > > drivers/gpu/drm/i915/i915_debugfs.c | 5 +++-- > > drivers/gpu/drm/i915/i915_drv.h | 2 ++ > > drivers/gpu/drm/i915/intel_hangcheck.c | 17 ++++++++++++++++- > > drivers/gpu/drm/i915/intel_ringbuffer.h | 3 ++- > > 4 files changed, 23 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c > > index 82d06a03b22f..7e68684be1e4 100644 > > --- a/drivers/gpu/drm/i915/i915_debugfs.c > > +++ b/drivers/gpu/drm/i915/i915_debugfs.c > > @@ -1362,11 +1362,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused) > > seq_printf(m, " seqno = %x [current %x, last %x]\n", > > engine->hangcheck.seqno, seqno[id], > > intel_engine_last_submit(engine)); > > - seq_printf(m, " waiters? %s, fake irq active? %s, stalled? %s\n", > > + seq_printf(m, " waiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n", > > yesno(intel_engine_has_waiter(engine)), > > yesno(test_bit(engine->id, > > &dev_priv->gpu_error.missed_irq_rings)), > > - yesno(engine->hangcheck.stalled)); > > + yesno(engine->hangcheck.stalled), > > + yesno(engine->hangcheck.wedged)); > > > > spin_lock_irq(&b->rb_lock); > > for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) { > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > > index 6649962e991a..d254a29a59a0 100644 > > --- a/drivers/gpu/drm/i915/i915_drv.h > > +++ b/drivers/gpu/drm/i915/i915_drv.h > > @@ -909,6 +909,8 @@ struct i915_gem_mm { > > #define I915_ENGINE_DEAD_TIMEOUT (4 * HZ) /* Seqno, head and subunits dead */ > > #define I915_SEQNO_DEAD_TIMEOUT (12 * HZ) /* Seqno dead with active head */ > > > > +#define I915_ENGINE_WEDGED_TIMEOUT (60 * HZ) /* Reset but no recovery? */ > > + > > enum modeset_restore { > > MODESET_ON_LID_OPEN, > > MODESET_DONE, > > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c > > index d47e346bd49e..2fc7a0dd0df9 100644 > > --- a/drivers/gpu/drm/i915/intel_hangcheck.c > > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c > > @@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine, > > engine->hangcheck.seqno = hc->seqno; > > engine->hangcheck.action = hc->action; > > engine->hangcheck.stalled = hc->stalled; > > + engine->hangcheck.wedged = hc->wedged; > > } > > > > static enum intel_engine_hangcheck_action > > @@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine, > > > > hc->stalled = time_after(jiffies, > > engine->hangcheck.action_timestamp + timeout); > > + hc->wedged = time_after(jiffies, > > + engine->hangcheck.action_timestamp + > > + I915_ENGINE_WEDGED_TIMEOUT); > > I was concerned that some callpath does end up zeroing > hangcheck->seqno through intel_engine_init_hangcheck. > > It seems that seqno wrap and unparking are the > only ones, with exception of setup paths that > does this. > > But as both wrap and unpark are done with idle > precondition, this should only at worst bring > one extra tick to reload the seqnos. > > So I can't poke holes in this and it should > prevent our driver mistakes ending up in > endless resets loops, like the tin says. > > Reviewed-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> Thanks, it's just meant to be a safety net against our (driver) bugs, so I don't mind it being coarse. Just loud and abrasive. Pushed, -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx