Re: [PATCH] drm/i915: Save hangcheck score across resets

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Thu, 6 Oct 2016 08:10:19 +0100

On Thu, Oct 06, 2016 at 10:00:09AM +0300, Mika Kuoppala wrote:
> Hangcheck score has been zeroed on engine init, which happens
> after reset recovery. This has worked well as we always reset
> all engines on hang, and also discarded all work submitted
> to engines.
> 
> With commit 821ed7df6e2a ("drm/i915: Update reset path to fix
> incomplete requests") driver gained capability to only discard
> the request or requests that were directly involved with the hang,
> and those who were deemed innocent, were replayed intact.
> 
> Our hangcheck works by periodically sampling the engine state and
> then doing checks in multiple stages to see if engine is making
> progress. The engine capabilities differ. With render engine, we
> have a more ways to measure the progress and thus more checks and
> stages. With other engines, we only sample the seqno and head movement.
> 
> Now consider that blitter engine is waiting on render and render engine
> has a batch which has stuck. Due to simpler checks, the blitter engine
> hangcheck score accumulates faster and reaches reset threshold quicker.
> We also blame the blitter for the hang as it had the highest score
> when recovery started.

This is the bug. It shouldn't accumulate any score in this case as the
engine is not active.

This patch is not the right approach for the issue as described here.
Because as soon as the blitter engine is active again, there is a very
real danger of it being declared guilty and reset.

The patch has merit, but not for this issue...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx