Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > On Tue, Nov 15, 2016 at 04:36:32PM +0200, Mika Kuoppala wrote: >> +static bool >> +hangcheck_engine_stall(struct intel_engine_cs *engine, >> + struct intel_engine_hangcheck *hc) >> +{ >> + const unsigned long last_action = engine->hangcheck.action_timestamp; >> >> - memset(&engine->hangcheck.instdone, 0, >> - sizeof(engine->hangcheck.instdone)); >> - break; >> + if (hc->action == HANGCHECK_ACTIVE_SEQNO || >> + hc->action == HANGCHECK_IDLE) >> + return false; >> + >> + if (time_before(jiffies, last_action + HANGCHECK_HUNG_JIFFIES)) >> + return false; >> + >> + if (time_before(jiffies, last_action + HANGCHECK_STUCK_JIFFIES)) >> + if (hc->action != HANGCHECK_HUNG) >> + return false; > > This deserves a lot more explanation. Why two values? What are their > significance? > #define DRM_I915_STUCK_PERIOD_SEC 24 /* No observed seqno progress */ #define DRM_I915_HUNG_PERIOD_SEC 4 /* No observed seqno nor head progress */ We allow bigger timeout if head is moving inside a request. THus two values. I tried to mimic the behaviour from old code. > Do we want to be using jiffies? (Values are in seconds, so jiffie > resoluation makes sense, but add that as a comment somewhere.) The defines as seconds and the subsequent jiffie counterparts in i915_drv.h not enough? > >> + return true; >> +} >> + >> +static struct intel_engine_cs *find_lra_engine(struct drm_i915_private *i915, >> + const unsigned int mask) > > What is lra? > Least recently active. I will opencode the name. >> +{ >> + struct intel_engine_cs *engine = NULL, *c; >> + enum intel_engine_id id; >> + >> + for_each_engine_masked(c, i915, mask, id) { >> + if (engine == NULL) { >> + engine = c; >> + continue; >> + } >> + >> + if (time_before(c->hangcheck.action_timestamp, >> + engine->hangcheck.action_timestamp)) >> + engine = c; >> + else if (c->hangcheck.action_timestamp == >> + engine->hangcheck.action_timestamp && >> + c->hangcheck.seqno < engine->hangcheck.seqno) >> + engine = c; > > Why is the last engine significant? We blame the engine which had work and was inactive for the longest amount of time. > Why is just engine guilty? Just one? As the stamps are valid cross reset, the next advacement after this hang will make progress show on this engine, and the next engine will get caught. The hangcheck action is a filter to weed out those engines that can't possibly be culprit for hang. So we aim to pinpoint only one engine with accuracy and reset that to lean towards per engine resets. -Mika > > > Too many whys in this one. > -Chris > > -- > Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx