On Sun, Dec 18, 2016 at 09:02:33PM -0800, Michel Thierry wrote: > > > On 12/16/2016 12:45 PM, Chris Wilson wrote: > >On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote: > >>From: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx> > >> > >>This change implements support for per-engine reset as an initial, less > >>intrusive hang recovery option to be attempted before falling back to the > >>legacy full GPU reset recovery mode if necessary. This is only supported > >>from Gen8 onwards. > >> > >>Hangchecker determines which engines are hung and invokes error handler to > >>recover from it. Error handler schedules recovery for each of those engines > >>that are hung. The recovery procedure is as follows, > >> - identifies the request that caused the hang and it is dropped > >> - force engine to idle: this is done by issuing a reset request > >> - reset and re-init engine > >> - restart submissions to the engine > >> > >>If engine reset fails then we fall back to heavy weight full gpu reset > >>which resets all engines and reinitiazes complete state of HW and SW. > >> > >>v2: Rebase. > >> > >>Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > >>Signed-off-by: Tomas Elf <tomas.elf@xxxxxxxxx> > >>Signed-off-by: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx> > >>Signed-off-by: Michel Thierry <michel.thierry@xxxxxxxxx> > >>--- > >> drivers/gpu/drm/i915/i915_drv.c | 56 +++++++++++++++++++++++++++++++++++-- > >> drivers/gpu/drm/i915/i915_drv.h | 3 ++ > >> drivers/gpu/drm/i915/i915_gem.c | 2 +- > >> drivers/gpu/drm/i915/intel_lrc.c | 12 ++++++++ > >> drivers/gpu/drm/i915/intel_lrc.h | 1 + > >> drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++--- > >> 6 files changed, 108 insertions(+), 7 deletions(-) > >> > >>diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c > >>index e5688edd62cd..a034793bc246 100644 > >>--- a/drivers/gpu/drm/i915/i915_drv.c > >>+++ b/drivers/gpu/drm/i915/i915_drv.c > >>@@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv) > >> * > >> * Reset a specific GPU engine. Useful if a hang is detected. > >> * Returns zero on successful reset or otherwise an error code. > >>+ * > >>+ * Procedure is fairly simple: > >>+ * - identifies the request that caused the hang and it is dropped > >>+ * - force engine to idle: this is done by issuing a reset request > >>+ * - reset engine > >>+ * - restart submissions to the engine > >> */ > >> int i915_reset_engine(struct intel_engine_cs *engine) > > > >What's the serialisation between potential callers of > >i915_reset_engine()? > > > > I haven't seen simultaneous calls happening yet, would a > reset_engine-specific mutex be enough? That is what it more or less boils down to. But first, do we ensure we don't declare a second hang on this engine before the first reset is complete? Then we only a barrier between a single engine reset and a full reset. > >> { > >> int ret; > >> struct drm_i915_private *dev_priv = engine->i915; > >> > >>- /* FIXME: replace me with engine reset sequence */ > >>- ret = -ENODEV; > >>+ /* > >>+ * We need to first idle the engine by issuing a reset request, > >>+ * then perform soft reset and re-initialize hw state, for all of > >>+ * this GT power need to be awake so ensure it does throughout the > >>+ * process > >>+ */ > >>+ intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL); > >>+ > >>+ /* > >>+ * the request that caused the hang is stuck on elsp, identify the > >>+ * active request and drop it, adjust head to skip the offending > >>+ * request to resume executing remaining requests in the queue. > >>+ */ > >>+ i915_gem_reset_engine(engine); > > > >Must freeze the engine and irqs first, before calling > >i915_gem_reset_engine() (i.e. something like disable_engines_irq, > >cancelling tasklet) > > > Will do. > > >Eeek note that the current i915_gem_reset_engine() is lacking a > >spinlock. > > > > The new mutex (for i915_reset_engine) should cover this. No. We need to protect these lists from concurrent manipulation in the fences. The spinlock protection there needs to be extended here, being covered by struct_mutex is no longer sufficient protection for the current code. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx