Quoting Mika Kuoppala (2019-11-04 14:27:21) > Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > > > Sometimes Icelake forgets to reset the CSB pointers on a GPU reset, > > leading to carry on updating the old tail of the buffer. > > > > <0>[ 618.138490] i915_sel-5636 3d..1 673425465us : trace_ports: vecs0: submit { 14de2:504, 0:0 } > > <0>[ 618.138490] i915_sel-5636 3.... 673425493us : intel_engine_reset: vecs0 flags=100 > > <0>[ 618.138490] i915_sel-5636 3.... 673425493us : execlists_reset_prepare: vecs0: depth<-0 > > <0>[ 618.138490] i915_sel-5636 3.... 673425493us : intel_engine_stop_cs: vecs0 > > <0>[ 618.138490] i915_sel-5636 3.... 673425523us : __intel_gt_reset: engine_mask=40 > > <0>[ 618.138490] i915_sel-5636 3.... 673425568us : execlists_reset: vecs0 > > <0>[ 618.138490] i915_sel-5636 3d..1 673425568us : process_csb: vecs0 cs-irq head=1, tail=2 > > <0>[ 618.138490] i915_sel-5636 3d..1 673425568us : process_csb: vecs0 csb[2]: status=0x00000001:0x40000000 > > <0>[ 618.138490] i915_sel-5636 3d..1 673425569us : trace_ports: vecs0: promote { 14de2:504*, 0:0 } > > <0>[ 618.138490] i915_sel-5636 3d..1 673425570us : __i915_request_reset: vecs0 rq=14de2:504, guilty? yes > > <0>[ 618.138490] i915_sel-5636 3d..1 673425571us : __execlists_reset: vecs0 replay {head:2de0, tail:2e48} > > <0>[ 618.138490] i915_sel-5636 3d..1 673425572us : __i915_request_unsubmit: vecs0 fence 14de2:504, current 503 > > <0>[ 618.138490] i915_sel-5636 3.... 673435544us : intel_engine_cancel_stop_cs: vecs0 > > <0>[ 618.138490] i915_sel-5636 3.... 673435544us : process_csb: vecs0 cs-irq head=11, tail=11 > > <0>[ 618.138490] i915_sel-5636 3d..1 673435545us : __i915_request_submit: vecs0 fence 14de2:504, current 503 > > <0>[ 618.138490] i915_sel-5636 3d..1 673435546us : __execlists_submission_tasklet: vecs0: queue_priority_hint:-2147483648, submit:yes > > <0>[ 618.138490] i915_sel-5636 3d..1 673435548us : trace_ports: vecs0: submit { 14de2:504*, 0:0 } > > <0>[ 618.138490] i915_sel-5636 3.... 673435549us : execlists_reset_finish: vecs0: depth->0 > > <0>[ 618.138490] ksoftirq-21 2..s. 673435592us : process_csb: vecs0 cs-irq head=11, tail=3 > > <0>[ 618.138490] ksoftirq-21 2..s. 673435593us : process_csb: vecs0 csb[0]: status=0x00000001:0x40000000 > > <0>[ 618.138490] ksoftirq-21 2..s. 673435594us : trace_ports: vecs0: promote { 14de2:504*, 0:0 } > > <0>[ 618.138490] ksoftirq-21 2..s. 673435596us : process_csb: vecs0 csb[1]: status=0x00000018:0x40000040 > > <0>[ 618.138490] ksoftirq-21 2..s. 673435597us : trace_ports: vecs0: completed { 14de2:504*, 0:0 } > > <0>[ 618.138490] ksoftirq-21 2..s. 673435612us : process_csb: process_csb:2188 GEM_BUG_ON(!i915_request_completed(*execlists->active) && !reset_in_progress(execlists)) > > > > After the reset, we do another clflush before checking the CSB to be > > sure we see whatever was left in the CSB prior to the reset. So it is > > unlikely to be an incoherent view of the CSB, and more likely that > > Icelake didn't reset its pointers. > > > > References: 582a6f90aa0d ("drm/i915/execlists: Add a paranoid flush of the CSB pointers upon reset") > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > > --- > > drivers/gpu/drm/i915/gt/intel_lrc.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > > index 8d79a965f341..b869a9c7b3c6 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > > @@ -2996,6 +2996,14 @@ static void reset_csb_pointers(struct intel_engine_cs *engine) > > WRITE_ONCE(*execlists->csb_write, reset_value); > > wmb(); /* Make sure this is visible to HW (paranoia?) */ > > > > + /* > > + * Sometimes Icelakes forgets to reset its pointers on a GPU reset. > > + * Bludgeon them with a mmio update to be sure. > > + */ > > + ENGINE_WRITE(engine, RING_CONTEXT_STATUS_PTR, > > + reset_value << 8 | reset_value); > > + ENGINE_POSTING_READ(engine, RING_CONTEXT_STATUS_PTR); > > + > > Big hammer indeed. Next one is that we assert the reset value > and check that first one after reset comes out with 0x0? Yeah, I was thinking about that. Didn't seem trivial since it's not always a single event. Plus then rare reset code is then spreading into the hotpath :( -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx