On Thu, Feb 23, 2017 at 11:44:17AM -0800, Michel Thierry wrote: > *** General *** > > Watchdog timeout (or "media engine reset") is a feature that allows > userland applications to enable hang detection on individual batch buffers. > The detection mechanism itself is mostly bound to the hardware and the only > thing that the driver needs to do to support this form of hang detection > is to implement the interrupt handling support as well as watchdog command > emission before and after the emitted batch buffer start instruction in the > ring buffer. > > The principle of the hang detection mechanism is as follows: > > 1. Once the decision has been made to enable watchdog timeout for a > particular batch buffer and the driver is in the process of emitting the > batch buffer start instruction into the ring buffer it also emits a > watchdog timer start instruction before and a watchdog timer cancellation > instruction after the batch buffer start instruction in the ring buffer. > > 2. Once the GPU execution reaches the watchdog timer start instruction > the hardware watchdog counter is started by the hardware. The counter > keeps counting until either reaching a previously configured threshold > value or the timer cancellation instruction is executed. > > 2a. If the counter reaches the threshold value the hardware fires a > watchdog interrupt that is picked up by the watchdog interrupt handler. > This means that a hang has been detected and the driver needs to deal with > it the same way it would deal with a engine hang detected by the periodic > hang checker. The only difference between the two is that we already blamed > the active request (to ensure an engine reset). > > 2b. If the batch buffer completes and the execution reaches the watchdog > cancellation instruction before the watchdog counter reaches its > threshold value the watchdog is cancelled and nothing more comes of it. > No hang is detected. > > Note about future interaction with preemption: Preemption could happen > in a command sequence prior to watchdog counter getting disabled, > resulting in watchdog being triggered following preemption. The driver will > need to explicitly disable the watchdog counter as part of the > preemption sequence. > > *** This patch introduces: *** > > 1. IRQ handler code for watchdog timeout allowing direct hang recovery > based on hardware-driven hang detection, which then integrates directly > with the hang recovery path. This is independent of having per-engine reset > or just full gpu reset. > > 2. Watchdog specific register information. > > Currently the render engine and all available media engines support > watchdog timeout (VECS is only supported in GEN9). The specifications elude > to the BCS engine being supported but that is currently not supported by > this commit. > > Note that the value to stop the counter is different between render and > non-render engines. > > Signed-off-by: Tomas Elf <tomas.elf@xxxxxxxxx> > Signed-off-by: Ian Lister <ian.lister@xxxxxxxxx> > Signed-off-by: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx> > Signed-off-by: Michel Thierry <michel.thierry@xxxxxxxxx> > --- > drivers/gpu/drm/i915/i915_drv.h | 4 ++++ > drivers/gpu/drm/i915/i915_irq.c | 31 ++++++++++++++++++++++++++++++- > drivers/gpu/drm/i915/i915_reg.h | 6 ++++++ > drivers/gpu/drm/i915/intel_hangcheck.c | 13 +++++++++---- > drivers/gpu/drm/i915/intel_lrc.c | 16 ++++++++++++++++ > 5 files changed, 65 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index eed9ead1b592..0e4f4cc3c6de 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -1568,6 +1568,9 @@ struct i915_gpu_error { > * recovery. All waiters on the reset_queue will be woken when > * that happens. > * > + * When hw detects a hang before us, we can use I915_RESET_WATCHDOG to > + * report the hang detection cause accurately. > + * > * This counter is used by the wait_seqno code to notice that reset > * event happened and it needs to restart the entire ioctl (since most > * likely the seqno it waited for won't ever signal anytime soon). > @@ -1580,6 +1583,7 @@ struct i915_gpu_error { > > unsigned long flags; > #define I915_RESET_IN_PROGRESS 0 > +#define I915_RESET_WATCHDOG 2 /* looking at the future */ > #define I915_WEDGED (BITS_PER_LONG - 1) > > /** > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index bc70e2c451b2..4ef73363bbe9 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -1352,6 +1352,28 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift) > set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted); > tasklet_hi_schedule(&engine->irq_tasklet); > } > + > + if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) { > + struct drm_i915_private *dev_priv = engine->i915; > + u32 watchdog_disable; > + > + if (engine->id == RCS) > + watchdog_disable = GEN8_RCS_WATCHDOG_DISABLE; > + else > + watchdog_disable = GEN8_XCS_WATCHDOG_DISABLE; > + > + /* Stop the counter to prevent further timeout interrupts */ > + I915_WRITE_FW(RING_CNTR(engine->mmio_base), watchdog_disable); There's no guarrantee you hold forcewake, you need to use I915_WRITE. Better yet would be to avoid having to wait for forcewake within the hardirq handler. > + > + /* Make sure the active request will be marked as guilty */ > + engine->hangcheck.stalled = true; > + engine->hangcheck.seqno = intel_engine_get_seqno(engine); Just set a flag saying the engine->hangcheck.watchdog = true. Don't confuse us. engine->hangcheck.seqno does not give the guilty seqno! Also there is no guarrantee here that seqno is the guilty party. That's a nasty bug. Servicing the interrupt will be running in parallel with the GPU that may complete the request before we read the HWS. Please tell me we can use a PID along with the watchdog timer... -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx