Re: [PATCH 2/6] drm/i915: Decouple hang detection from hangcheck period

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes:

> On Tue, Nov 15, 2016 at 04:36:32PM +0200, Mika Kuoppala wrote:
>> +static bool
>> +hangcheck_engine_stall(struct intel_engine_cs *engine,
>> +		       struct intel_engine_hangcheck *hc)
>> +{
>> +	const unsigned long last_action = engine->hangcheck.action_timestamp;
>>  
>> -		memset(&engine->hangcheck.instdone, 0,
>> -		       sizeof(engine->hangcheck.instdone));
>> -		break;
>> +	if (hc->action == HANGCHECK_ACTIVE_SEQNO ||
>> +	    hc->action == HANGCHECK_IDLE)
>> +		return false;
>> +
>> +	if (time_before(jiffies, last_action + HANGCHECK_HUNG_JIFFIES))
>> +		return false;
>> +
>> +	if (time_before(jiffies, last_action + HANGCHECK_STUCK_JIFFIES))
>> +		if (hc->action != HANGCHECK_HUNG)
>> +			return false;
>
> This deserves a lot more explanation. Why two values? What are their
> significance?
>

#define DRM_I915_STUCK_PERIOD_SEC 24 /* No observed seqno progress */
#define DRM_I915_HUNG_PERIOD_SEC 4 /* No observed seqno nor head progress */

We allow bigger timeout if head is moving inside a request. THus two
values. I tried to mimic the behaviour from old code.

> Do we want to be using jiffies? (Values are in seconds, so jiffie
> resoluation makes sense, but add that as a comment somewhere.)

The defines as seconds and the subsequent jiffie counterparts in
i915_drv.h not enough?

>
>> +	return true;
>> +}
>> +
>> +static struct intel_engine_cs *find_lra_engine(struct drm_i915_private *i915,
>> +					       const unsigned int mask)
>
> What is lra?
>

Least recently active. I will opencode the name.

>> +{
>> +	struct intel_engine_cs *engine = NULL, *c;
>> +	enum intel_engine_id id;
>> +
>> +	for_each_engine_masked(c, i915, mask, id) {
>> +		if (engine == NULL) {
>> +			engine = c;
>> +			continue;
>> +		}
>> +
>> +		if (time_before(c->hangcheck.action_timestamp,
>> +				engine->hangcheck.action_timestamp))
>> +			engine = c;
>> +		else if (c->hangcheck.action_timestamp ==
>> +			 engine->hangcheck.action_timestamp &&
>> +			 c->hangcheck.seqno < engine->hangcheck.seqno)
>> +			engine = c;
>
> Why is the last engine significant?

We blame the engine which had work and was inactive for the
longest amount of time. 

> Why is just engine guilty?

Just one? As the stamps are valid cross reset, the next
advacement after this hang will make progress show
on this engine, and the next engine will get caught.

The hangcheck action is a filter to weed out those engines
that can't possibly be culprit for hang.

So we aim to pinpoint only one engine with accuracy and
reset that to lean towards per engine resets.

-Mika

>
>
> Too many whys in this one.
> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux