Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > On Tue, Sep 06, 2016 at 02:11:47PM +0300, Mika Kuoppala wrote: >> >> Resending my r-b... > > It's not enough, even retrying the reset a few times, we still > eventually get a timeout. > > This is just the usual > 10: 0xffffffff > 20: goto 10 > hanging batch >> >> Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: >> >> > If at first we don't succeed, try again. >> > >> > Running the reset and recovery routines in a loop ends in a "reset >> > request timeout" with a mtbf of an hour on Braswell. This is eerily >> > similar to the unrecoverable reset condition that first prompted us to >> > use the reset-request mechanism in commit 7fd2d26921d1 ("drm/i915: Reset >> > request handling for gen8+"). Repeating the reset request makes the >> > failure much harder to reproduce (but there is no reason to believe that >> > it is more than mere paper over a timing or other issue). >> > >> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> >> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxx> >> > Cc: Arun Siluvery <arun.siluvery@xxxxxxxxxxxxxxx> >> > Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> >> > Cc: stable@xxxxxxxxxxxxxxx >> > --- >> > drivers/gpu/drm/i915/intel_uncore.c | 24 +++++++++++++----------- >> > 1 file changed, 13 insertions(+), 11 deletions(-) >> > >> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c >> > index e9f68cd56e32..1be8ced03ba5 100644 >> > --- a/drivers/gpu/drm/i915/intel_uncore.c >> > +++ b/drivers/gpu/drm/i915/intel_uncore.c >> > @@ -1688,20 +1688,22 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv, >> > static int gen8_request_engine_reset(struct intel_engine_cs *engine) >> > { >> > struct drm_i915_private *dev_priv = engine->i915; >> > - int ret; >> > + int loop = 3; >> > >> >> retries? >> >> > - I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base), >> > - _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET)); >> > + do { >> > + I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base), >> > + _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET)); >> > >> > - ret = intel_wait_for_register_fw(dev_priv, >> > - RING_RESET_CTL(engine->mmio_base), >> > - RESET_CTL_READY_TO_RESET, >> > - RESET_CTL_READY_TO_RESET, >> > - 700); >> > - if (ret) >> > - DRM_ERROR("%s: reset request timeout\n", engine->name); >> > + if (!intel_wait_for_register_fw(dev_priv, >> > + RING_RESET_CTL(engine->mmio_base), >> > + RESET_CTL_READY_TO_RESET, >> > + RESET_CTL_READY_TO_RESET, >> > + 700)) >> >> >> Did you check between the write didn't stick vs the readyness didn't >> signal? > > The read will flush the write. So reading back the bit we just set shows > us that the hw is not yet ready. > > Shouldn't we also be waiting for the reest bit to clear on cancel? And > verifying that the reset itself did? > Request -> timeout -> cancel and wait 2 lowest bit to be zero and then only after that retry? And would not hurt to check they are zero before continuing after a succesful reset also. -Mika > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c > index 6b84bd63310c..36d9a485b788 100644 > --- a/drivers/gpu/drm/i915/intel_uncore.c > +++ b/drivers/gpu/drm/i915/intel_uncore.c > @@ -1701,6 +1701,9 @@ static void gen8_unrequest_engine_reset(struct intel_engine_cs *engine) > > I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base), > _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); > + intel_wait_for_register_fw(dev_priv, > + RING_RESET_CTL(engine->mmio_base), > + RESET_CTL_READY_TO_RESET, 0, 700); > } > > static int gen8_reset_engines(struct drm_i915_private *dev_priv, > @@ -1708,18 +1711,18 @@ static int gen8_reset_engines(struct drm_i915_private *dev_priv, > { > struct intel_engine_cs *engine; > unsigned int tmp; > + int ret = -EIO; > > for_each_engine_masked(engine, dev_priv, engine_mask, tmp) > if (gen8_request_engine_reset(engine)) > goto not_ready; > > - return gen6_reset_engines(dev_priv, engine_mask); > + ret = gen6_reset_engines(dev_priv, engine_mask); > > -not_ready: > for_each_engine_masked(engine, dev_priv, engine_mask, tmp) > gen8_unrequest_engine_reset(engine); > > - return -EIO; > + return ret; > } > > typedef int (*reset_func)(struct drm_i915_private *, unsigned engine_mask); > > -- > Chris Wilson, Intel Open Source Technology Centre -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html