Hangcheck score has been zeroed on engine init, which happens after reset recovery. This has worked well as we always reset all engines on hang, and also discarded all work submitted to engines. With commit 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests") driver gained capability to only discard the request or requests that were directly involved with the hang, and those who were deemed innocent, were replayed intact. Our hangcheck works by periodically sampling the engine state and then doing checks in multiple stages to see if engine is making progress. The engine capabilities differ. With render engine, we have a more ways to measure the progress and thus more checks and stages. With other engines, we only sample the seqno and head movement. Now consider that blitter engine is waiting on render and render engine has a batch which has stuck. Due to simpler checks, the blitter engine hangcheck score accumulates faster and reaches reset threshold quicker. We also blame the blitter for the hang as it had the highest score when recovery started. Blaming the wrong engine, we don't find the actual guilty request and most critically, won't make any progress after the reset. That will lead to second hang, with the same pattern, ad infinitum. Previously the false blaming of engine was not critical as score was only used as a trigger for full reset and debug aid in error states. But now, the score is essential of finding the culprit request. To fix this, keep the hangcheck scores across resets. We already have a decay mechanism in place if progress is being made. This ensures that even if we blame the wrong engine once, we don't do it twice or consistently, and the real culprit request will be cleared, real progress will be made and this untangles rest of the engines and lead to successful recovery. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98104 Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> Signed-off-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxx> --- drivers/gpu/drm/i915/intel_engine_cs.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c index d00ec805f93d..4bb869eb11bc 100644 --- a/drivers/gpu/drm/i915/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/intel_engine_cs.c @@ -209,7 +209,6 @@ void intel_engine_init_seqno(struct intel_engine_cs *engine, u32 seqno) void intel_engine_init_hangcheck(struct intel_engine_cs *engine) { - memset(&engine->hangcheck, 0, sizeof(engine->hangcheck)); clear_bit(engine->id, &engine->i915->gpu_error.missed_irq_rings); if (intel_engine_has_waiter(engine)) i915_queue_hangcheck(engine->i915); -- 2.7.4 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx