Re: [PATCH 7/7] drm/i915: Stop parking the signaler around reset

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Wed, 16 May 2018 10:58:58 +0100

Quoting Tvrtko Ursulin (2018-05-16 10:49:34)
> 
> On 16/05/2018 10:25, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-05-16 10:15:34)
> >>
> >> On 16/05/2018 07:47, Chris Wilson wrote:
> >>> We cannot call kthread_park() from softirq context, so let's avoid it
> >>> entirely during the reset. We wanted to suspend the signaler so that it
> >>> would not mark a request as complete at the same time as we marked it as
> >>> being in error. Instead of parking the signaling, stop the engine from
> >>> advancing so that the GPU doesn't emit the breadcrumb for our chosen
> >>> "guilty" request.
> >>>
> >>> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> >>> Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
> >>> Cc: Michał Winiarski <michal.winiarski@xxxxxxxxx>
> >>> CC: Michel Thierry <michel.thierry@xxxxxxxxx>
> >>> Cc: Jeff McGee <jeff.mcgee@xxxxxxxxx>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> >>> ---
> > 
> >>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> >>> index af5a178366ed..bb88a92fcc1e 100644
> >>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> >>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> >>> @@ -531,8 +531,26 @@ static int init_ring_common(struct intel_engine_cs *engine)
> >>>        return ret;
> >>>    }
> >>>    
> >>> +static void set_stop_engine(struct intel_engine_cs *engine)
> >>> +{
> >>> +     struct drm_i915_private *dev_priv = engine->i915;
> >>> +     const u32 base = engine->mmio_base;
> >>> +     const i915_reg_t mode = RING_MI_MODE(base);
> >>> +
> >>> +     I915_WRITE_FW(mode, _MASKED_BIT_ENABLE(STOP_RING));
> >>> +     if (__intel_wait_for_register_fw(dev_priv,
> >>> +                                      mode, MODE_IDLE, MODE_IDLE,
> >>> +                                      1000, 0,
> >>> +                                      NULL))
> >>> +             DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n",
> >>> +                              engine->name);
> >>> +}
> >>
> >> Looks to be exactly the same implementation as in intel_lrc.c apart from
> >> debug vs trace. Move to intel_engine_cs.c?
> > 
> > Not unless you also suggest a name I like ;)
> > 
> > Mika once had plans for engine->stop() so I didn't think too hard about
> > this bit, expecting it to be superseded.
> 
> Vfunc, or intel_engine_stop, anything but two of the same.
> 
> Timed out message should also possibly even be upgraded to notice level.

It doesn't make any difference unless an error occurs later. The
STOP_RING will time out, we know it does. MODE_IDLE will only be set on
the next arbitration point (aiui) :(

Such cases are usually only resolved by wedging as the reset itself also
fail.

> >>>    static struct i915_request *reset_prepare(struct intel_engine_cs *engine)
> >>>    {
> >>> +     if (INTEL_GEN(engine->i915) >= 3)
> >>> +             set_stop_engine(engine);
> >>> +
> >>>        if (engine->irq_seqno_barrier)
> >>>                engine->irq_seqno_barrier(engine);
> >>>    
> >>>
> >>
> >> Most important question - is stopping the engine expect to mostly work
> >> with a palette of different hang types?
> > 
> > The good news is that if the engine is hung, it doesn't matter! So what
> > it reliably stops is Command Streamer execution along the ring, and we
> > only need to rely on it stopping before the MI_STORE_DWORD of the
> > breadcrumb to be sure that we won't see the breadcrumb advance as we
> > process reset.
> > 
> > Now what's stopping the breadcrumb still being in flight (from the GPU)
> > as we sample it (from the CPU)?
> > Not much.
> 
> If it reliably stops the command streamer then that sounds good enough 
> to me.
> 
> The in-flight write should be unlikely, since we detected a hang that 
> means any activity happened long time ago.

Rule of thumb is to always use "hang" ;)

- we may have falsely declared a hang and the gpu was just slow

- another engine is hung, this one is chugging along oblivious

However, the uncached read is just what I would use to allow the write
from the GPU to be visible to the CPU, so maybe just add another?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx