On Mon, 2019-03-25 at 10:00 +0000, Tvrtko Ursulin wrote: > On 22/03/2019 23:41, Carlos Santa wrote: > > From: Michel Thierry <michel.thierry@xxxxxxxxx> > > > > *** General *** > > > > Watchdog timeout (or "media engine reset") is a feature that allows > > userland applications to enable hang detection on individual batch > > buffers. > > The detection mechanism itself is mostly bound to the hardware and > > the only > > thing that the driver needs to do to support this form of hang > > detection > > is to implement the interrupt handling support as well as watchdog > > command > > emission before and after the emitted batch buffer start > > instruction in the > > ring buffer. > > > > The principle of the hang detection mechanism is as follows: > > > > 1. Once the decision has been made to enable watchdog timeout for a > > particular batch buffer and the driver is in the process of > > emitting the > > batch buffer start instruction into the ring buffer it also emits a > > watchdog timer start instruction before and a watchdog timer > > cancellation > > instruction after the batch buffer start instruction in the ring > > buffer. > > > > 2. Once the GPU execution reaches the watchdog timer start > > instruction > > the hardware watchdog counter is started by the hardware. The > > counter > > keeps counting until either reaching a previously configured > > threshold > > value or the timer cancellation instruction is executed. > > > > 2a. If the counter reaches the threshold value the hardware fires a > > watchdog interrupt that is picked up by the watchdog interrupt > > handler. > > This means that a hang has been detected and the driver needs to > > deal with > > it the same way it would deal with a engine hang detected by the > > periodic > > hang checker. The only difference between the two is that we > > already blamed > > the active request (to ensure an engine reset). > > > > 2b. If the batch buffer completes and the execution reaches the > > watchdog > > cancellation instruction before the watchdog counter reaches its > > threshold value the watchdog is cancelled and nothing more comes of > > it. > > No hang is detected. > > > > Note about future interaction with preemption: Preemption could > > happen > > in a command sequence prior to watchdog counter getting disabled, > > resulting in watchdog being triggered following preemption (e.g. > > when > > watchdog had been enabled in the low priority batch). The driver > > will > > need to explicitly disable the watchdog counter as part of the > > preemption sequence. > > > > *** This patch introduces: *** > > > > 1. IRQ handler code for watchdog timeout allowing direct hang > > recovery > > based on hardware-driven hang detection, which then integrates > > directly > > with the hang recovery path. This is independent of having per- > > engine reset > > or just full gpu reset. > > > > 2. Watchdog specific register information. > > > > Currently the render engine and all available media engines support > > watchdog timeout (VECS is only supported in GEN9). The > > specifications elude > > to the BCS engine being supported but that is currently not > > supported by > > this commit. > > > > Note that the value to stop the counter is different between render > > and > > non-render engines in GEN8; GEN9 onwards it's the same. > > > > v2: Move irq handler to tasklet, arm watchdog for a 2nd time to > > check > > against false-positives. > > > > v3: Don't use high priority tasklet, use engine_last_submit while > > checking for false-positives. From GEN9 onwards, the stop counter > > bit is > > the same for all engines. > > > > v4: Remove unnecessary brackets, use current_seqno to mark the > > request > > as guilty in the hangcheck/capture code. > > > > v5: Rebased after RESET_ENGINEs flag. > > > > v6: Don't capture error state in case of watchdog timeout. The > > capture > > process is time consuming and this will align to what happens when > > we > > use GuC to handle the watchdog timeout. (Chris) > > > > v7: Rebase. > > > > v8: Rebase, use HZ to reschedule. > > > > v9: Rebase, get forcewake domains in function (no longer in > > execlists > > struct). > > > > v10: Rebase. > > > > v11: Rebase, > > remove extra braces (Tvrtko), > > implement watchdog_to_clock_counts helper (Tvrtko), > > Move tasklet_kill(watchdog_tasklet) inside intel_engines > > (Tvrtko), > > Use a global heartbeat seqno instead of engine seqno (Chris) > > Make all engines checks all class based checks (Tvrtko) > > > > v12: Rebase, > > Reset immediately upon entering the IRQ (Chris) > > Make reset_engine_to_str a helper (Tvrtko) > > Rename watchdog_irq_handler as watchdog_tasklet (Tvrtko) > > Let the compiler itself do the inline (Tvrtko) > > > > v13: Rebase > > v14: Rebase, skip checking for the guilty seqno in the tasklet > > (Tvrtko) > > IIRC I only asked about it so I guess you tried it and it works well? > > Can you also post the IGTs so we can see the test coverage? > > Regards, > > Tvrtko Yeah, the unit test works but it's still the very simple one (updated for the uAPI mods) I started with last August, I still need to add more to it though... https://lists.freedesktop.org/archives/igt-dev/2018-September/005834.html Right now, I am on the hrtimer workaround to see how that works, will get back to the test coverage after that... Regards, Carlos > > > > > Cc: Antonio Argenziano <antonio.argenziano@xxxxxxxxx> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> > > Signed-off-by: Michel Thierry <michel.thierry@xxxxxxxxx> > > Signed-off-by: Carlos Santa <carlos.santa@xxxxxxxxx> > > --- > > drivers/gpu/drm/i915/i915_gpu_error.h | 4 ++ > > drivers/gpu/drm/i915/i915_irq.c | 14 ++++-- > > drivers/gpu/drm/i915/i915_reg.h | 6 +++ > > drivers/gpu/drm/i915/i915_reset.c | 20 +++++++++ > > drivers/gpu/drm/i915/i915_reset.h | 6 +++ > > drivers/gpu/drm/i915/intel_engine_cs.c | 1 + > > drivers/gpu/drm/i915/intel_engine_types.h | 5 +++ > > drivers/gpu/drm/i915/intel_hangcheck.c | 11 +---- > > drivers/gpu/drm/i915/intel_lrc.c | 52 > > +++++++++++++++++++++++ > > 9 files changed, 107 insertions(+), 12 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h > > b/drivers/gpu/drm/i915/i915_gpu_error.h > > index 99d6b7b270c2..6cf6a8679b26 100644 > > --- a/drivers/gpu/drm/i915/i915_gpu_error.h > > +++ b/drivers/gpu/drm/i915/i915_gpu_error.h > > @@ -203,6 +203,9 @@ struct i915_gpu_error { > > * any global resources that may be clobber by the reset (such > > as > > * FENCE registers). > > * > > + * #I915_RESET_WATCHDOG - When hw detects a hang before us, we > > can use > > + * I915_RESET_WATCHDOG to report the hang detection cause > > accurately. > > + * > > * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't > > need to > > * acquire the struct_mutex to reset an engine, we need an > > explicit > > * flag to prevent two concurrent reset attempts in the same > > engine. > > @@ -218,6 +221,7 @@ struct i915_gpu_error { > > #define I915_RESET_BACKOFF 0 > > #define I915_RESET_MODESET 1 > > #define I915_RESET_ENGINE 2 > > +#define I915_RESET_WATCHDOG 3 > > #define I915_WEDGED (BITS_PER_LONG - 1) > > > > /** Number of times the device has been reset (global) */ > > diff --git a/drivers/gpu/drm/i915/i915_irq.c > > b/drivers/gpu/drm/i915/i915_irq.c > > index 82d487189a34..e64994be25c3 100644 > > --- a/drivers/gpu/drm/i915/i915_irq.c > > +++ b/drivers/gpu/drm/i915/i915_irq.c > > @@ -1466,6 +1466,9 @@ gen8_cs_irq_handler(struct intel_engine_cs > > *engine, u32 iir) > > > > if (tasklet) > > tasklet_hi_schedule(&engine->execlists.tasklet); > > + > > + if (iir & GT_GEN8_WATCHDOG_INTERRUPT) > > + tasklet_schedule(&engine->execlists.watchdog_tasklet); > > } > > > > static void gen8_gt_irq_ack(struct drm_i915_private *i915, > > @@ -3892,20 +3895,25 @@ static void gen8_gt_irq_postinstall(struct > > drm_i915_private *dev_priv) > > u32 gt_interrupts[] = { > > (GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT | > > GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT | > > + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT | > > GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT | > > GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT), > > - > > (GT_RENDER_USER_INTERRUPT << GEN8_VCS0_IRQ_SHIFT | > > GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS0_IRQ_SHIFT | > > + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS0_IRQ_SHIFT | > > GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT | > > - GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT), > > - > > + GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT | > > + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT), > > 0, > > > > (GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT | > > GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT) > > }; > > > > + /* VECS watchdog is only available in skl+ */ > > + if (INTEL_GEN(dev_priv) >= 9) > > + gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT; > > + > > dev_priv->pm_ier = 0x0; > > dev_priv->pm_imr = ~dev_priv->pm_ier; > > GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]); > > diff --git a/drivers/gpu/drm/i915/i915_reg.h > > b/drivers/gpu/drm/i915/i915_reg.h > > index 9b69cec21f7b..ac8d984e16ae 100644 > > --- a/drivers/gpu/drm/i915/i915_reg.h > > +++ b/drivers/gpu/drm/i915/i915_reg.h > > @@ -2363,6 +2363,11 @@ enum i915_power_well_id { > > #define RING_START(base) _MMIO((base) + 0x38) > > #define RING_CTL(base) _MMIO((base) + 0x3c) > > #define RING_CTL_SIZE(size) ((size) - PAGE_SIZE) /* in > > bytes -> pages */ > > +#define RING_CNTR(base) _MMIO((base) + 0x178) > > +#define GEN8_WATCHDOG_ENABLE 0 > > +#define GEN8_WATCHDOG_DISABLE 1 > > +#define GEN8_XCS_WATCHDOG_DISABLE 0xFFFFFFFF /* GEN8 & non-render > > only */ > > +#define RING_THRESH(base) _MMIO((base) + 0x17C) > > #define RING_SYNC_0(base) _MMIO((base) + 0x40) > > #define RING_SYNC_1(base) _MMIO((base) + 0x44) > > #define RING_SYNC_2(base) _MMIO((base) + 0x48) > > @@ -2925,6 +2930,7 @@ enum i915_power_well_id { > > #define GT_BSD_USER_INTERRUPT (1 << 12) > > #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1 (1 << 11) /* > > hsw+; rsvd on snb, ivb, vlv */ > > #define GT_CONTEXT_SWITCH_INTERRUPT (1 << 8) > > +#define GT_GEN8_WATCHDOG_INTERRUPT (1 << 6) /* gen8+ */ > > #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT (1 << 5) /* > > !snb */ > > #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT (1 << 4) > > #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT (1 << 3) > > diff --git a/drivers/gpu/drm/i915/i915_reset.c > > b/drivers/gpu/drm/i915/i915_reset.c > > index 861fe083e383..739fa5ad1a8d 100644 > > --- a/drivers/gpu/drm/i915/i915_reset.c > > +++ b/drivers/gpu/drm/i915/i915_reset.c > > @@ -1208,6 +1208,26 @@ void i915_clear_error_registers(struct > > drm_i915_private *dev_priv) > > } > > } > > > > +void engine_reset_error_to_str(struct drm_i915_private *i915, > > + char *msg, > > + size_t sz, > > + unsigned int hung, > > + unsigned int stuck, > > + unsigned int watchdog) > > +{ > > + int len; > > + unsigned int tmp; > > + struct intel_engine_cs *engine; > > + > > + len = scnprintf(msg, sz, > > + "%s on ", watchdog ? "watchdog timeout" : > > + stuck == hung ? "no_progress" : > > "hang"); > > + for_each_engine_masked(engine, i915, hung, tmp) > > + len += scnprintf(msg + len, sz - len, > > + "%s, ", engine->name); > > + msg[len-2] = '\0'; > > +} > > + > > /** > > * i915_handle_error - handle a gpu error > > * @i915: i915 device private > > diff --git a/drivers/gpu/drm/i915/i915_reset.h > > b/drivers/gpu/drm/i915/i915_reset.h > > index 16f2389f656f..8582d1242248 100644 > > --- a/drivers/gpu/drm/i915/i915_reset.h > > +++ b/drivers/gpu/drm/i915/i915_reset.h > > @@ -20,6 +20,12 @@ void i915_handle_error(struct drm_i915_private > > *i915, > > u32 engine_mask, > > unsigned long flags, > > const char *fmt, ...); > > +void engine_reset_error_to_str(struct drm_i915_private *i915, > > + char *str, > > + size_t sz, > > + unsigned int hung, > > + unsigned int stuck, > > + unsigned int watchdog); > > #define I915_ERROR_CAPTURE BIT(0) > > > > void i915_clear_error_registers(struct drm_i915_private *i915); > > diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c > > b/drivers/gpu/drm/i915/intel_engine_cs.c > > index 652c1b3ba190..88cf0fc07623 100644 > > --- a/drivers/gpu/drm/i915/intel_engine_cs.c > > +++ b/drivers/gpu/drm/i915/intel_engine_cs.c > > @@ -1149,6 +1149,7 @@ void intel_engines_park(struct > > drm_i915_private *i915) > > /* Flush the residual irq tasklets first. */ > > intel_engine_disarm_breadcrumbs(engine); > > tasklet_kill(&engine->execlists.tasklet); > > + tasklet_kill(&engine->execlists.watchdog_tasklet); > > > > /* > > * We are committed now to parking the engines, make > > sure there > > diff --git a/drivers/gpu/drm/i915/intel_engine_types.h > > b/drivers/gpu/drm/i915/intel_engine_types.h > > index b0aa1f0d4e47..c4f66b774e7c 100644 > > --- a/drivers/gpu/drm/i915/intel_engine_types.h > > +++ b/drivers/gpu/drm/i915/intel_engine_types.h > > @@ -124,6 +124,11 @@ struct intel_engine_execlists { > > */ > > struct tasklet_struct tasklet; > > > > + /* > > + * @watchdog_tasklet: stop counter and reschedule > > hangcheck_work asap > > + */ > > + struct tasklet_struct watchdog_tasklet; > > + > > /** > > * @default_priolist: priority list for I915_PRIORITY_NORMAL > > */ > > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c > > b/drivers/gpu/drm/i915/intel_hangcheck.c > > index 57ed49dc19c4..4bf26863678c 100644 > > --- a/drivers/gpu/drm/i915/intel_hangcheck.c > > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c > > @@ -220,22 +220,15 @@ static void hangcheck_declare_hang(struct > > drm_i915_private *i915, > > unsigned int hung, > > unsigned int stuck) > > { > > - struct intel_engine_cs *engine; > > char msg[80]; > > - unsigned int tmp; > > - int len; > > + size_t len = sizeof(msg); > > > > /* If some rings hung but others were still busy, only > > * blame the hanging rings in the synopsis. > > */ > > if (stuck != hung) > > hung &= ~stuck; > > - len = scnprintf(msg, sizeof(msg), > > - "%s on ", stuck == hung ? "no progress" : > > "hang"); > > - for_each_engine_masked(engine, i915, hung, tmp) > > - len += scnprintf(msg + len, sizeof(msg) - len, > > - "%s, ", engine->name); > > - msg[len-2] = '\0'; > > + engine_reset_error_to_str(i915, msg, len, hung, stuck, 0); > > > > return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", > > msg); > > } > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c > > b/drivers/gpu/drm/i915/intel_lrc.c > > index e54e0064b2d6..85785a94f6ae 100644 > > --- a/drivers/gpu/drm/i915/intel_lrc.c > > +++ b/drivers/gpu/drm/i915/intel_lrc.c > > @@ -2195,6 +2195,40 @@ static int gen8_emit_flush_render(struct > > i915_request *request, > > return 0; > > } > > > > +static void gen8_watchdog_tasklet(unsigned long data) > > +{ > > + struct intel_engine_cs *engine = (struct intel_engine_cs > > *)data; > > + struct drm_i915_private *dev_priv = engine->i915; > > + enum forcewake_domains fw_domains; > > + char msg[80]; > > + size_t len = sizeof(msg); > > + unsigned long *lock = &engine->i915->gpu_error.flags; > > + unsigned int bit = I915_RESET_ENGINE + engine->id; > > + > > + switch (engine->class) { > > + default: > > + MISSING_CASE(engine->id); > > + /* fall through */ > > + case RENDER_CLASS: > > + fw_domains = FORCEWAKE_RENDER; > > + break; > > + case VIDEO_DECODE_CLASS: > > + case VIDEO_ENHANCEMENT_CLASS: > > + fw_domains = FORCEWAKE_MEDIA; > > + break; > > + } > > + > > + intel_uncore_forcewake_get(dev_priv, fw_domains); > > + > > + if (!test_and_set_bit(bit, lock)) { > > + unsigned int hung = engine->mask; > > + engine_reset_error_to_str(dev_priv, msg, len, hung, 0, > > 1); > > + i915_reset_engine(engine, msg); > > + clear_bit(bit, lock); > > + wake_up_bit(lock, bit); > > + } > > +} > > + > > /* > > * Reserve space for 2 NOOPs at the end of each request to be > > * used as a workaround for not being allowed to do lite > > @@ -2377,6 +2411,21 @@ logical_ring_default_irqs(struct > > intel_engine_cs *engine) > > > > engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift; > > engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift; > > + > > + switch (engine->class) { > > + default: > > + /* BCS engine does not support hw watchdog */ > > + break; > > + case RENDER_CLASS: > > + case VIDEO_DECODE_CLASS: > > + engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT << > > shift; > > + break; > > + case VIDEO_ENHANCEMENT_CLASS: > > + if (INTEL_GEN(engine->i915) >= 9) > > + engine->irq_keep_mask |= > > + GT_GEN8_WATCHDOG_INTERRUPT << shift; > > + break; > > + } > > } > > > > static int > > @@ -2394,6 +2443,9 @@ logical_ring_setup(struct intel_engine_cs > > *engine) > > tasklet_init(&engine->execlists.tasklet, > > execlists_submission_tasklet, (unsigned > > long)engine); > > > > + tasklet_init(&engine->execlists.watchdog_tasklet, > > + gen8_watchdog_tasklet, (unsigned long)engine); > > + > > logical_ring_default_vfuncs(engine); > > logical_ring_default_irqs(engine); > > > > _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx