Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > If the engine isn't being retired (worker starvation?) then it is > possible for us to repeatedly observe that between consecutive > hangchecks the seqno on the ring to be the same and there remain > unretired requests. Ignore these completely and only regard the engine > as busy for the purpose of hang detection (not stall detection) if there > are outstanding breadcrumbs. > > In recent history we have looked at using both the request and seqno as > indication of activity on the engine, but that was reduced to just > inspecting seqno in commit cffa781e5907 ("drm/i915: Simplify check for > idleness in hangcheck"). However, in commit dcff85c8443e ("drm/i915: > Enable i915_gem_wait_for_idle() without holding struct_mutex"), I made > the decision to use the new common lockless function, under the > assumption that request retirement was more frequent than hangcheck and > so we would not have a stuck busy check. The flaw there was in > forgetting that we accumulate the hang score, and so successive checks > seeing a stuck request, albeit with the GPU advancing elsewhere and so > not necessary the same stuck request, would eventually trigger the hang. > > Fixes: dcff85c8443e ("drm/i915: Enable i915_gem_wait_for_idle()...") > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxx> > --- > drivers/gpu/drm/i915/i915_irq.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index ebb83d5a448b..7610eca4f687 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -3079,6 +3079,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work) > bool busy = intel_engine_has_waiter(engine); > u64 acthd; > u32 seqno; > + u32 submit; > > semaphore_clear_deadlocks(dev_priv); > > @@ -3094,9 +3095,10 @@ static void i915_hangcheck_elapsed(struct work_struct *work) > > acthd = intel_engine_get_active_head(engine); > seqno = intel_engine_get_seqno(engine); > + submit = READ_ONCE(engine->last_submitted_seqno); > > if (engine->hangcheck.seqno == seqno) { > - if (!intel_engine_is_active(engine)) { > + if (i915_seqno_passed(seqno, submit)) { Setting of busy could be moved in the in scope. Also the check could be seqno == submit and warning if we see seqno on engine past submit. But the patch fixes what it says it does, Reviewed-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxx> > engine->hangcheck.action = HANGCHECK_IDLE; > if (busy) { > /* Safeguard against driver failure */ > -- > 2.9.3 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx