Re: [PATCH 2/2] drm/i915: Force reset on unready engine

Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> · Mon, 13 Aug 2018 12:58:10 +0300

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes:

> Quoting Mika Kuoppala (2018-08-13 09:18:07)
>> Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes:
>> 
>> > Quoting Mika Kuoppala (2018-08-10 15:00:36)
>> >> If engine reports that it is not ready for reset, we
>> >> give up. Evidence shows that forcing a per engine reset
>> >> on an engine which is not reporting to be ready for reset,
>> >> can bring it back into a working order. There is risk that
>> >> we corrupt the context image currently executing on that
>> >> engine. But that is a risk worth taking as if we unblock
>> >> the engine, we prevent a whole device wedging in a case
>> >> of full gpu reset.
>> >> 
>> >> Reset individual engine even if it reports that it is not
>> >> prepared for reset, but only if we aim for full gpu reset
>> >> and not on first reset attempt.
>> >> 
>> >> v2: force reset only on later attempts, readability (Chris)
>> >> 
>> >> Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
>> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx>
>> >> ---
>> >>  drivers/gpu/drm/i915/intel_uncore.c | 49 +++++++++++++++++++++++------
>> >>  1 file changed, 39 insertions(+), 10 deletions(-)
>> >> 
>> >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> >> index 027d14574bfa..d24026839b17 100644
>> >> --- a/drivers/gpu/drm/i915/intel_uncore.c
>> >> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> >> @@ -2099,9 +2099,6 @@ static int gen8_reset_engine_start(struct intel_engine_cs *engine)
>> >>                                            RESET_CTL_READY_TO_RESET,
>> >>                                            700, 0,
>> >>                                            NULL);
>> >> -       if (ret)
>> >> -               DRM_ERROR("%s: reset request timeout\n", engine->name);
>> >> -
>> >>         return ret;
>> >>  }
>> >>  
>> >> @@ -2113,6 +2110,42 @@ static void gen8_reset_engine_cancel(struct intel_engine_cs *engine)
>> >>                       _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
>> >>  }
>> >>  
>> >> +static int reset_engines(struct drm_i915_private *i915,
>> >> +                        unsigned int engine_mask,
>> >> +                        unsigned int retry)
>> >> +{
>> >> +       if (INTEL_GEN(i915) >= 11)
>> >> +               return gen11_reset_engines(i915, engine_mask);
>> >> +       else
>> >> +               return gen6_reset_engines(i915, engine_mask, retry);
>> >> +}
>> >> +
>> >> +static int gen8_prepare_engine_for_reset(struct intel_engine_cs *engine,
>> >> +                                        unsigned int retry)
>> >> +{
>> >> +       const bool force_reset_on_non_ready = retry >= 1;
>> >> +       int ret;
>> >> +
>> >> +       ret = gen8_reset_engine_start(engine);
>> >> +
>> >> +       if (ret && force_reset_on_non_ready) {
>> >> +               /*
>> >> +                * Try to unblock a single non-ready engine by risking
>> >> +                * context corruption.
>> >> +                */
>> >> +               ret = reset_engines(engine->i915,
>> >> +                                   intel_engine_flag(engine),
>> >> +                                   retry);
>> >> +               if (!ret)
>> >> +                       ret = gen8_reset_engine_start(engine);
>> >> +
>> >> +               DRM_ERROR("%s: reset request timeout, forcing reset (%d)\n",
>> >> +                         engine->name, ret);
>> >
>> > This looks dubious now ;)
>> >
>> > After the force you then do a reset in the caller. Twice the reset for
>> > twice the unpreparedness.
>> 
>> It is intentional. First we make the engine unstuck and then do
>> a full cycle with ready to reset involved. I don't know if
>> it really matters tho. It could be that the engine is already
>> in pristine condition after first.
>
> Looks extremely weird way of going about it. If you want to do a double
> reset after the first try fails, try

The crux is: failing to reset or failing to preparing to reset.
It is the ready for reset dance not working that we are trying to
by doing extra per engine reset. So trying in the higher lever,
again with the same preconditions would not help.

-Mika

>
> diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> index 4e5826045cbf..6fe137d7d455 100644
> --- a/drivers/gpu/drm/i915/i915_reset.c
> +++ b/drivers/gpu/drm/i915/i915_reset.c
> @@ -450,6 +450,8 @@ int intel_gpu_reset(struct drm_i915_private *i915, unsigned int engine_mask)
>                         GEM_TRACE("engine_mask=%x\n", engine_mask);
>                         preempt_disable();
>                         ret = reset(i915, engine_mask);
> +                       if (retry > 0 && !ret) /* Double check reset worked */
> +                               ret = reset(i915, engine_mask);
>                         preempt_enable();
>                 }
>                 if (ret != -ETIMEDOUT || engine_mask != ALL_ENGINES)
>
> as a separate patch.
>
> P.S. you really need to review the i915_reset.c patch ;)
> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx