Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Fri, 18 May 2018 13:36:52 +0100

On 18/05/2018 13:28, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-05-18 12:50:41)

On 18/05/2018 12:10, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-05-18 12:05:17)

On 18/05/2018 11:09, Chris Wilson wrote:
Inside the live_hangcheck (reset) selftests, we occasionally see
failures like

<7>[  239.094840] i915_gem_set_wedged rcs0
<7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
<7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
<7>[  239.094848] i915_gem_set_wedged         Requests:
<7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
<7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
<7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
<7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
<7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
<7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
<7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
<7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
<7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
<7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
<7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
<7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
<7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
<7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
<7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
<7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
<7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
<7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
<7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
<7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
<7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
<7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
<7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
<7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
<7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
<7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
<7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
<7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
<7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
<7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
<7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
<7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
<7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99

where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
The RING_MODE indicates that is idle and has the STOP_RING bit set, so
try clearing it.

v2: Only clear the bit on restarting the ring, as we want to be sure the
STOP_RING bit is kept if reset fails on wedging.
v3: Spot when the ring state doesn't make sense when re-initialising the
engine and dump it to the logs so that we don't have to wait for an
error later and try to guess what happened earlier.
v4: Prepare to print all the unexpected state, not just the first.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
---
    drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
    1 file changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 3744f5750624..ba8411ba4abf 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
                I915_WRITE(RING_MODE_GEN7(engine),
                           _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
    
+     I915_WRITE(RING_MI_MODE(engine->mmio_base),
+                _MASKED_BIT_DISABLE(STOP_RING));

Worries me a bit to clear it unconditionally since documentation says
nothing (that I can find) about this scenario.

+
        I915_WRITE(RING_HWS_PGA(engine->mmio_base),
                   engine->status_page.ggtt_offset);
        POSTING_READ(RING_HWS_PGA(engine->mmio_base));
@@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
        engine->execlists.csb_head = -1;
    }
    
+static bool unexpected_starting_state(struct intel_engine_cs *engine)
+{
+     struct drm_i915_private *dev_priv = engine->i915;
+     bool unexpected = false;
+
+     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
+             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");

Also worries me to bury this as a debug message unless we can find in
documentation that this is known to happen occasionally.

If we had the check and logging immediately after reset, with a more
visible log, we could build a larger data set of reported failures and
then go back to hw people and ask them about it.

The problem with the reset side is trying to avoid mixing in the case
where we are faking it and expect it to still be set. On the ringbuffer
path, we handle STOP_RING as being part of the init sequence. The log
already says we did a reset and STOP_RING is set (since we have tracing
through the reset), I just need to find the right spot to handle it.

The "distance" in code paths between reset and init_hw is probably the
smallest of my concerns. If we could make this message not debug only,
and clear STOP_RING only conditionally I think I would be happy with
giving this a go and see if it collects any data.

Why? Who else but us are going to read it, and why bother if it is not
actionable. It is logged (was logged already, now highlighted) and for
the user none of it is relevant, the only thing that matters to them is
that they don't even notice the system suffered a GPU hang.

My reasoning is this - something has happened which is completely 
unexpected and may have other negative effect which are currently 
unknown to us. By logging it at a higher level we collect this now very 
specific signature on a wider collection of platforms and can then act 
on the collected data set. Also very specific log message makes bug 
triage easier and a lower chance it gets bunched up with other hang reports.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx