On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote: > From: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx> > > This change implements support for per-engine reset as an initial, less > intrusive hang recovery option to be attempted before falling back to the > legacy full GPU reset recovery mode if necessary. This is only supported > from Gen8 onwards. > > Hangchecker determines which engines are hung and invokes error handler to > recover from it. Error handler schedules recovery for each of those engines > that are hung. The recovery procedure is as follows, > - identifies the request that caused the hang and it is dropped > - force engine to idle: this is done by issuing a reset request > - reset and re-init engine > - restart submissions to the engine > > If engine reset fails then we fall back to heavy weight full gpu reset > which resets all engines and reinitiazes complete state of HW and SW. > > v2: Rebase. > > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > Signed-off-by: Tomas Elf <tomas.elf@xxxxxxxxx> > Signed-off-by: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx> > Signed-off-by: Michel Thierry <michel.thierry@xxxxxxxxx> > --- > drivers/gpu/drm/i915/i915_drv.c | 56 +++++++++++++++++++++++++++++++++++-- > drivers/gpu/drm/i915/i915_drv.h | 3 ++ > drivers/gpu/drm/i915/i915_gem.c | 2 +- > drivers/gpu/drm/i915/intel_lrc.c | 12 ++++++++ > drivers/gpu/drm/i915/intel_lrc.h | 1 + > drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++--- > 6 files changed, 108 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c > index e5688edd62cd..a034793bc246 100644 > --- a/drivers/gpu/drm/i915/i915_drv.c > +++ b/drivers/gpu/drm/i915/i915_drv.c > @@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv) > * > * Reset a specific GPU engine. Useful if a hang is detected. > * Returns zero on successful reset or otherwise an error code. > + * > + * Procedure is fairly simple: > + * - identifies the request that caused the hang and it is dropped > + * - force engine to idle: this is done by issuing a reset request > + * - reset engine > + * - restart submissions to the engine > */ > int i915_reset_engine(struct intel_engine_cs *engine) What's the serialisation between potential callers of i915_reset_engine()? > { > int ret; > struct drm_i915_private *dev_priv = engine->i915; > > - /* FIXME: replace me with engine reset sequence */ > - ret = -ENODEV; > + /* > + * We need to first idle the engine by issuing a reset request, > + * then perform soft reset and re-initialize hw state, for all of > + * this GT power need to be awake so ensure it does throughout the > + * process > + */ > + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL); > + > + /* > + * the request that caused the hang is stuck on elsp, identify the > + * active request and drop it, adjust head to skip the offending > + * request to resume executing remaining requests in the queue. > + */ > + i915_gem_reset_engine(engine); Must freeze the engine and irqs first, before calling i915_gem_reset_engine() (i.e. something like disable_engines_irq, cancelling tasklet) Eeek note that the current i915_gem_reset_engine() is lacking a spinlock. > + > + ret = intel_engine_reset_begin(engine); > + if (ret) { > + DRM_ERROR("Failed to disable %s\n", engine->name); > + goto error; > + } > + > + ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine)); > + if (ret) { > + DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret); > + intel_engine_reset_cancel(engine); > + goto error; > + } > + > + ret = engine->init_hw(engine); > + if (ret) > + goto error; > > + intel_engine_reset_cancel(engine); > + intel_execlists_restart_submission(engine); engine->init_hw(engine) *is* intel_execlists_restart_submission. > + > + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); > + return 0; > + > +error: > /* use full gpu reset to recover on error */ > set_bit(I915_RESET_IN_PROGRESS, &dev_priv->gpu_error.flags); > > + /* Engine reset is performed without taking struct_mutex, since it > + * failed we now fallback to full gpu reset. Wakeup any waiters > + * which should now see the reset_in_progress and release > + * struct_mutex for us to continue recovery. > + */ > + rcu_read_lock(); > + intel_engine_wakeup(engine); > + rcu_read_unlock(); > + > + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); > return ret; > } -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx