Re: [PATCH 5/8] drm/i915: Double check hangcheck.seqno after reset

Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> · Mon, 03 Oct 2016 16:14:39 +0300

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes:

> Check that there was not a late recovery between us declaring the GPU
> hung and processing the reset. If the GPU did recover by itself, let the
> request remain on the active list and see if it hangs again!
>

Did you see this in action? Makes sense to recheck
after reset. I don't remember how TDR will deal with multiple
reset on the same engine but we should start tracking the seqno
that cause it and make sure we don't get stuck by replaying the same.

Do we check the banning on resubmission and/or do we trust that
the breadcrumb update always succeedes?

I envision that if we get multiple resets on same seqno, we
just write the breadcrumbs through cpu and move on. But let's
hope we don't need to and the gpu breadcrumps are always enough.

Regardless, it's improvement and should weed out false positives
on some hangs.

Reviewed-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxx>
-Mika

> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxx>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 0cae8acdf906..a89a88922448 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2589,6 +2589,9 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>  		return;
>  
>  	ring_hung = engine->hangcheck.score >= HANGCHECK_SCORE_RING_HUNG;
> +	if (engine->hangcheck.seqno != intel_engine_get_seqno(engine))
> +		ring_hung = false;
> +
>  	i915_set_reset_status(request->ctx, ring_hung);
>  	if (!ring_hung)
>  		return;
> -- 
> 2.9.3
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx