Quoting Michel Thierry (2018-04-27 21:27:46) > On 4/27/2018 1:24 PM, Chris Wilson wrote: > > Previously, we just reset the ring register in the context image such > > that we could skip over the broken batch and emit the closing > > breadcrumb. However, on resume the context image and GPU state would be > > reloaded, which may have been left in an inconsistent state by the > > reset. The presumption was that at worst it would just cause another > > reset and skip again until it recovered, however it seems just as likely > > to cause an unrecoverable hang. Instead of risking loading an incomplete > > context image, restore it back to the default state. > > > > v2: Fix up off-by-one from including the ppHSWP in with the register > > state. > > > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > > Cc: Michał Winiarski <michal.winiarski@xxxxxxxxx> > > Cc: Michel Thierry <michel.thierry@xxxxxxxxx> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > > Reviewed-by: Michel Thierry <michel.thierry@xxxxxxxxx> > > Does it need a 'Fixes:' tag or has a bugzilla reference? I suspect it's rare enough that the unrecoverable hang might not be recognisable in bugzilla. I was just looking at https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log trying to think of ways how the reset might appear to work but the recovery fail with <7>[ 521.765114] missed_breadcrumb vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915] <7>[ 521.765176] missed_breadcrumb current seqno e4e, last e4f, hangcheck e4e [2048 ms], inflight 1 <7>[ 521.765191] missed_breadcrumb Reset count: 0 (global 0) <7>[ 521.765206] missed_breadcrumb Requests: <7>[ 521.765223] missed_breadcrumb first e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0 <7>[ 521.765239] missed_breadcrumb last e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0 <7>[ 521.765256] missed_breadcrumb active e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0 <7>[ 521.765274] missed_breadcrumb [head 3900, postfix 3930, tail 3948, batch 0x00000000_00042000] <7>[ 521.765289] missed_breadcrumb ring->start: 0x008ef000 <7>[ 521.765301] missed_breadcrumb ring->head: 0x000038f8 <7>[ 521.765313] missed_breadcrumb ring->tail: 0x00003948 <7>[ 521.765325] missed_breadcrumb ring->emit: 0x00003950 <7>[ 521.765337] missed_breadcrumb ring->space: 0x00002618 <7>[ 521.765372] missed_breadcrumb RING_START: 0x008ef000 <7>[ 521.765389] missed_breadcrumb RING_HEAD: 0x000038f8 <7>[ 521.765404] missed_breadcrumb RING_TAIL: 0x00003948 <7>[ 521.765422] missed_breadcrumb RING_CTL: 0x00003001 <7>[ 521.765438] missed_breadcrumb RING_MODE: 0x00000000 <7>[ 521.765453] missed_breadcrumb RING_IMR: fffffefe <7>[ 521.765473] missed_breadcrumb ACTHD: 0x00000000_022039b8 <7>[ 521.765492] missed_breadcrumb BBADDR: 0x00000000_00042004 <7>[ 521.765511] missed_breadcrumb DMA_FADDR: 0x00000000_008f28f8 <7>[ 521.765537] missed_breadcrumb IPEIR: 0x00000000 <7>[ 521.765552] missed_breadcrumb IPEHR: 0x11000011 <7>[ 521.765570] missed_breadcrumb Execlist status: 0x00044032 00000002 <7>[ 521.765586] missed_breadcrumb Execlist CSB read 1 [1 cached], write 2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled) <7>[ 521.765604] missed_breadcrumb Execlist CSB[2]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp] <7>[ 521.765619] missed_breadcrumb ELSP[0] count=1, rq: e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0 <7>[ 521.765632] missed_breadcrumb ELSP[1] idle <7>[ 521.765645] missed_breadcrumb HW active? 0x1 <7>[ 521.765660] missed_breadcrumb E e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0 <7>[ 521.765670] missed_breadcrumb Queue priority: -2147483648 <7>[ 521.765684] missed_breadcrumb gem_sync [3112] waiting for e4f <7>[ 521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no) <7>[ 521.765707] missed_breadcrumb HWSP: <7>[ 521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 521.765733] missed_breadcrumb * <7>[ 521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 00000002 00000001 00000000 00000018 00000002 <7>[ 521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 00000002 00000000 00000000 00000000 00000002 <7>[ 521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 521.765784] missed_breadcrumb * <7>[ 521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 521.765833] missed_breadcrumb * <7>[ 521.765845] missed_breadcrumb Idle? no Of particular note being the IPEHR being MI_LRI, the ring being idle (it hasn't moved on from the earlier reset) and the fetch address being unconnected to the rings, so naturally I assume it died loading the context image on resume. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx