Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > On Thu, May 18, 2017 at 05:28:41PM +0300, Mika Kuoppala wrote: >> ELK seems to very picky about the preconditions to reset. >> Evidence on Eaglelake (8086:2e12 (rev 03)) shows that it does >> not like if reset occurs when there is active ring. >> >> Ville found out that there is workaround with name >> 'WaMediaResetMainRingCleanup' which suggests that we need to >> cleanup rings before resetting. It is unclear what cleanup >> exactly means but evidence shows that stopping the ring >> does have an effect on reset reliability. This patch makes >> reset succesful on hangs induced by chained batches (the igt ones). >> Note that if the hang is inside a shader, it is possible >> that our attempts to stop the ring achieves anything. >> >> v2: zero ctl,head,tail also. bug ref. use driver debugs (Chris) >> >> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100942 >> Testcase: igt/gem_busy/*-hang >> Testcase: igt/gem_ringfill/hang-* > > Maybe add # elk to these to indicate the problem isn't quite that > widespread! > >> Suggested-by: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> >> Cc: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> >> Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> >> Cc: Tomi Sarvela <tomi.p.sarvela@xxxxxxxxx> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxx> >> --- >> drivers/gpu/drm/i915/intel_uncore.c | 35 +++++++++++++++++++++++++++++++++++ >> 1 file changed, 35 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c >> index 7eaaf2225e1a..43da84be0321 100644 >> --- a/drivers/gpu/drm/i915/intel_uncore.c >> +++ b/drivers/gpu/drm/i915/intel_uncore.c >> @@ -1427,6 +1427,35 @@ int i915_reg_read_ioctl(struct drm_device *dev, >> return ret; >> } >> >> +static void gen3_stop_rings(struct drm_i915_private *dev_priv) >> +{ >> + struct intel_engine_cs *engine; >> + enum intel_engine_id id; >> + >> + for_each_engine(engine, dev_priv, id) { >> + const u32 base = engine->mmio_base; >> + const i915_reg_t mode = RING_MI_MODE(base); >> + >> + I915_WRITE_FW(mode, _MASKED_BIT_ENABLE(STOP_RING)); >> + if (intel_wait_for_register_fw(dev_priv, >> + mode, >> + MODE_IDLE, >> + MODE_IDLE, >> + 500)) >> + DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n", >> + engine->name); >> + >> + I915_WRITE_FW(RING_CTL(base), 0); >> + I915_WRITE_FW(RING_HEAD(base), 0); >> + I915_WRITE_FW(RING_TAIL(base), 0); >> + >> + /* Check acts as a post */ >> + if (I915_READ_FW(RING_HEAD(base)) != 0) >> + DRM_DEBUG_DRIVER("%s: ring head not parked\n", >> + engine->name); >> + } >> +} >> + >> static bool i915_reset_complete(struct pci_dev *pdev) >> { >> u8 gdrst; >> @@ -1472,6 +1501,12 @@ static int g4x_do_reset(struct drm_i915_private *dev_priv, unsigned engine_mask) >> I915_WRITE(VDECCLK_GATE_D, I915_READ(VDECCLK_GATE_D) | VCP_UNIT_CLOCK_GATE_DISABLE); >> POSTING_READ(VDECCLK_GATE_D); >> >> + /* We stop engines, otherwise we might get failed reset and a >> + * dead gpu (on elk). >> + */ >> + /* WaMediaResetMainRingCleanup:ctg,elk (supposedly) */ > > Join this into a single comment block, s/supposedly/presumably/ > > Just a small concern we have some duplication of stop_ring() here, but I > don't have a better suggestion (along the lines of export intel_stop_ring, > gen3_engine_stop_ring, so far looks more confusing than helpful). As > you I had a patch which piggypacked engine->reset_hw(engine, NULL) to do the dirty work of stopping the ring. But the stop_ring of() intel_ringbuffer.c was giving up halfway if it didn't find idling the ring succesful, leaving head/tail intact. And that was on the prepare reset path. The boon was that it stopped the rings before killing the tasklet. But I decided to do more surgical approach directy on top of reset. If we find another gen which is suspectible, we might want to either piggypack on reset_hw or do a engine->stop_ring() and use it in prepare for reset path. > have tested with DRM_ERROR to be sure that fear about this simply > timing out for our spinning batches, it looks good to me. > Spinning batches in general seem to go idle nice, but gem_ringfill will spew out that ring_stop timeout debug. > Reviewed-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> Thanks for review. Patch pushed. -Mika _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx