Re: [CI 03/20] drm/i915: Delay queuing hangcheck to wait-request

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Thu, 19 May 2016 13:52:20 +0100

On Thu, May 19, 2016 at 01:34:55PM +0100, Tvrtko Ursulin wrote:
> 
> On 19/05/16 12:32, Chris Wilson wrote:
> >We can forgo queuing the hangcheck from the start of every request to
> >until we wait upon a request. This reduces the overhead of every
> >request, but may increase the latency of detecting a hang. Howeever, if
> >nothing every waits upon a hang, did it ever hang? It also improves the
> 
> Maybe it is better to detect and reset sooner rather than later in
> case the hang is burning power and may now only get looked at at
> some random time in the future? It is probably quite weak argument
> but it came to mind.

Yes. The counter argument is that it runs for seconds anyway until a
hang is found and for quick detection and recovery we have a watchdog
that we aren't using yet (but one day will switch to). So what I have in
mind here is to radically simplify the hangcheck and base it entirely on
whether the waiter (or the scheduler) is making progress. At some point,
we always need to wait on the GPU for a resource. (And then it just
becomes a plain DoS detector as a fallback to fine grained recovery.)

> >robustness of the wait-request by ensuring that the hangchecker is
> >indeed running before we sleep indefinitely (and thereby ensuring that
> >we never actually sleep forever waiting for a dead GPU).
> >
> >Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> >---
> >  drivers/gpu/drm/i915/i915_gem.c |  9 +++++----
> >  drivers/gpu/drm/i915/i915_irq.c | 10 ++++------
> >  2 files changed, 9 insertions(+), 10 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >index 24cab8802c2e..06013b9fbc6a 100644
> >--- a/drivers/gpu/drm/i915/i915_gem.c
> >+++ b/drivers/gpu/drm/i915/i915_gem.c
> >@@ -1310,6 +1310,9 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
> >  			break;
> >  		}
> >
> >+		/* Ensure that even if the GPU hangs, we get woken up. */
> >+		i915_queue_hangcheck(dev_priv);
> >+
> >  		timer.function = NULL;
> >  		if (timeout || missed_irq(dev_priv, engine)) {
> >  			unsigned long expire;
> >@@ -2654,8 +2657,6 @@ void __i915_add_request(struct drm_i915_gem_request *request,
> >  	/* Not allowed to fail! */
> >  	WARN(ret, "emit|add_request failed: %d!\n", ret);
> >
> >-	i915_queue_hangcheck(engine->i915);
> >-
> >  	queue_delayed_work(dev_priv->wq,
> >  			   &dev_priv->mm.retire_work,
> >  			   round_jiffies_up_relative(HZ));
> >@@ -2999,8 +3000,8 @@ i915_gem_retire_requests(struct drm_i915_private *dev_priv)
> >
> >  	if (idle)
> >  		mod_delayed_work(dev_priv->wq,
> >-				   &dev_priv->mm.idle_work,
> >-				   msecs_to_jiffies(100));
> >+				 &dev_priv->mm.idle_work,
> >+				 msecs_to_jiffies(100));
> >
> >  	return idle;
> >  }
> >diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> >index f0d941455bed..4818fcb5e960 100644
> >--- a/drivers/gpu/drm/i915/i915_irq.c
> >+++ b/drivers/gpu/drm/i915/i915_irq.c
> >@@ -3148,10 +3148,10 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
> >  	intel_uncore_arm_unclaimed_mmio_detection(dev_priv);
> >
> >  	for_each_engine_id(engine, dev_priv, id) {
> >+		bool busy = waitqueue_active(&engine->irq_queue);
> >  		u64 acthd;
> >  		u32 seqno;
> >  		unsigned user_interrupts;
> >-		bool busy = true;
> >
> >  		semaphore_clear_deadlocks(dev_priv);
> >
> >@@ -3174,12 +3174,11 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
> >  		if (engine->hangcheck.seqno == seqno) {
> >  			if (ring_idle(engine, seqno)) {
> >  				engine->hangcheck.action = HANGCHECK_IDLE;
> >-				if (waitqueue_active(&engine->irq_queue)) {
> >+				if (busy) {
> >  					/* Safeguard against driver failure */
> >  					user_interrupts = kick_waiters(engine);
> >  					engine->hangcheck.score += BUSY;
> >-				} else
> >-					busy = false;
> >+				}
> >  			} else {
> >  				/* We always increment the hangcheck score
> >  				 * if the ring is busy and still processing
> >@@ -3253,9 +3252,8 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
> >  		goto out;
> >  	}
> >
> >+	/* Reset timer in case GPU hangs without another request being added */
> >  	if (busy_count)
> >-		/* Reset timer case chip hangs without another request
> >-		 * being added */
> >  		i915_queue_hangcheck(dev_priv);
> >
> >  out:
> >
> 
> With a disclaimer that I don't really know hangcheck - it looks that
> previously it was getting re-queued all the time until the engine
> when idle, regardless of if there were waiters. With this change it
> looks like that is gone. If there are no waiters it won't get
> re-queued. How does incrementing the hangcheck score work now?

It gets reset between waits (i.e. the hangcheck is idle as would now be
the case for GPU idle) and then incremented for as long as someone is
waiting until it fires. In practice, it doesn't alter the discovery rate
at all.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx