Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+

Carlos Santa <carlos.santa@xxxxxxxxx> · Fri, 01 Mar 2019 18:08:52 -0800

On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote:
> Quoting Carlos Santa (2019-02-21 02:58:16)
> > +#define GEN8_WATCHDOG_1000US(dev_priv)
> > watchdog_to_clock_counts(dev_priv, 1000)
> > +static void gen8_watchdog_irq_handler(unsigned long data)
> > +{
> > +       struct intel_engine_cs *engine = (struct intel_engine_cs
> > *)data;
> > +       struct drm_i915_private *dev_priv = engine->i915;
> > +       unsigned int hung = 0;
> > +       u32 current_seqno=0;
> > +       char msg[80];
> > +       unsigned int tmp;
> > +       int len;
> > +
> > +       /* Stop the counter to prevent further timeout interrupts
> > */
> > +       I915_WRITE_FW(RING_CNTR(engine->mmio_base),
> > get_watchdog_disable(engine));
> > +
> > +       /* Read the heartbeat seqno once again to check if we are
> > stuck? */
> > +       current_seqno = intel_engine_get_hangcheck_seqno(engine);
> 
> I have said this before, but this doesn't exist either, it's just a
> temporary glitch in the matrix.

That was my only way to check for the "quilty" seqno right before
resetting during smoke testing... Will reach out again before sending a
new rev to cross check on the new approach you mentioned today.

> 
> > +    if (current_seqno == engine->current_seqno) {
> > +               hung |= engine->mask;
> > +
> > +               len = scnprintf(msg, sizeof(msg), "%s on ",
> > "watchdog timeout");
> > +               for_each_engine_masked(engine, dev_priv, hung, tmp)
> > +                       len += scnprintf(msg + len, sizeof(msg) -
> > len,
> > +                                        "%s, ", engine->name);
> > +               msg[len-2] = '\0';
> > +
> > +               i915_handle_error(dev_priv, hung, 0, "%s", msg);
> > +
> > +               /* Reset timer in case GPU hangs without another
> > request being added */
> > +               i915_queue_hangcheck(dev_priv);
> 
> You still haven't explained why we are not just resetting the engine
> immediately. Have you looked at the preempt-timeout patches that need
> to
> do the same thing from timer-irq context?
> 
> Resending the same old stuff over and over again is just
> exasperating.
> -Chris

Oops, I had the wrong assumption, as I honestly thought removing the
workqueue from v3 would allow for an immediate reset. Thanks for the
feedback on the preempt-timeout series... will rework this. 

Carlos

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx