Re: [PATCH v2] drm/i915/lrc: Scrub the GPU state of the guilty hanging request

Michel Thierry <michel.thierry@xxxxxxxxx> · Fri, 27 Apr 2018 15:30:06 -0700

On 4/27/2018 1:35 PM, Chris Wilson wrote:
Quoting Michel Thierry (2018-04-27 21:27:46)
On 4/27/2018 1:24 PM, Chris Wilson wrote:
Previously, we just reset the ring register in the context image such
that we could skip over the broken batch and emit the closing
breadcrumb. However, on resume the context image and GPU state would be
reloaded, which may have been left in an inconsistent state by the
reset. The presumption was that at worst it would just cause another
reset and skip again until it recovered, however it seems just as likely
to cause an unrecoverable hang. Instead of risking loading an incomplete
context image, restore it back to the default state.

v2: Fix up off-by-one from including the ppHSWP in with the register
state.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
Cc: Michał Winiarski <michal.winiarski@xxxxxxxxx>
Cc: Michel Thierry <michel.thierry@xxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>

Reviewed-by: Michel Thierry <michel.thierry@xxxxxxxxx>

Does it need a 'Fixes:' tag or has a bugzilla reference?

I suspect it's rare enough that the unrecoverable hang might not be
recognisable in bugzilla. I was just looking at

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log

trying to think of ways how the reset might appear to work but the
recovery fail with

<7>[  521.765114] missed_breadcrumb vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915]
<7>[  521.765176] missed_breadcrumb 	current seqno e4e, last e4f, hangcheck e4e [2048 ms], inflight 1
<7>[  521.765191] missed_breadcrumb 	Reset count: 0 (global 0)
<7>[  521.765206] missed_breadcrumb 	Requests:
<7>[  521.765223] missed_breadcrumb 		first  e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765239] missed_breadcrumb 		last   e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765256] missed_breadcrumb 		active e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765274] missed_breadcrumb 		[head 3900, postfix 3930, tail 3948, batch 0x00000000_00042000]
<7>[  521.765289] missed_breadcrumb 		ring->start:  0x008ef000
<7>[  521.765301] missed_breadcrumb 		ring->head:   0x000038f8
<7>[  521.765313] missed_breadcrumb 		ring->tail:   0x00003948
<7>[  521.765325] missed_breadcrumb 		ring->emit:   0x00003950
<7>[  521.765337] missed_breadcrumb 		ring->space:  0x00002618
<7>[  521.765372] missed_breadcrumb 	RING_START: 0x008ef000
<7>[  521.765389] missed_breadcrumb 	RING_HEAD:  0x000038f8
<7>[  521.765404] missed_breadcrumb 	RING_TAIL:  0x00003948
<7>[  521.765422] missed_breadcrumb 	RING_CTL:   0x00003001
<7>[  521.765438] missed_breadcrumb 	RING_MODE:  0x00000000
<7>[  521.765453] missed_breadcrumb 	RING_IMR: fffffefe
<7>[  521.765473] missed_breadcrumb 	ACTHD:  0x00000000_022039b8
<7>[  521.765492] missed_breadcrumb 	BBADDR: 0x00000000_00042004
<7>[  521.765511] missed_breadcrumb 	DMA_FADDR: 0x00000000_008f28f8
<7>[  521.765537] missed_breadcrumb 	IPEIR: 0x00000000
<7>[  521.765552] missed_breadcrumb 	IPEHR: 0x11000011
<7>[  521.765570] missed_breadcrumb 	Execlist status: 0x00044032 00000002
<7>[  521.765586] missed_breadcrumb 	Execlist CSB read 1 [1 cached], write 2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  521.765604] missed_breadcrumb 	Execlist CSB[2]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp]
<7>[  521.765619] missed_breadcrumb 		ELSP[0] count=1, rq: e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[  521.765632] missed_breadcrumb 		ELSP[1] idle
<7>[  521.765645] missed_breadcrumb 		HW active? 0x1
<7>[  521.765660] missed_breadcrumb 		E e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[  521.765670] missed_breadcrumb 		Queue priority: -2147483648
<7>[  521.765684] missed_breadcrumb 	gem_sync [3112] waiting for e4f
<7>[  521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
<7>[  521.765707] missed_breadcrumb HWSP:
<7>[  521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765733] missed_breadcrumb *
<7>[  521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 00000002 00000001 00000000 00000018 00000002
<7>[  521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 00000002 00000000 00000000 00000000 00000002
<7>[  521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765784] missed_breadcrumb *
<7>[  521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765833] missed_breadcrumb *
<7>[  521.765845] missed_breadcrumb Idle? no

Of particular note being the IPEHR being MI_LRI, the ring being idle (it
hasn't moved on from the earlier reset) and the fetch address being
unconnected to the rings, so naturally I assume it died loading the
context image on resume.
Plus it is a bsw...
Agreed, this looks like an issue during the ctx restore.

-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx