Re: [RFC] drm/i915/bdw+: Do not emit user interrupts when not needed

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Fri, 18 Dec 2015 13:51:38 +0000

On 18/12/15 12:28, Chris Wilson wrote:
On Fri, Dec 18, 2015 at 11:59:41AM +0000, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>

We can rely on context complete interrupt to wake up the waiters
apart in the case where requests are merged into a single ELSP
submission. In this case we inject MI_USER_INTERRUPTS in the
ring buffer to ensure prompt wake-ups.

This optimization has the effect on for example GLBenchmark
Egypt off-screen test of decreasing the number of generated
interrupts per second by a factor of two, and context switched
by factor of five to six.

I half like it. Are the interupts a limiting factor in this case though?
This should be ~100 waits/second with ~1000 batches/second, right? What
is the delay between request completion and client wakeup - difficult to
measure after you remove the user interrupt though! But I estimate it
should be on the order of just a few GPU cycles.

Neither of the two benchmarks I ran (trex onscreen and egypt offscreen) 
show any framerate improvements.

The only thing I did manage to measure is that CPU energy usage goes 
down with the optimisation. Roughly 8-10%, courtesy of RAPL script 
someone posted here.

Benchmarking is generally very hard so it is a pity we don't have a farm 
similar to CI which does it all in a repeatable and solid manner.

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 27f06198a51e..d9be878dbde7 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -359,6 +359,13 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
  	spin_unlock(&dev_priv->uncore.lock);
  }

+static void execlists_emit_user_interrupt(struct drm_i915_gem_request *req)
+{
+	struct intel_ringbuffer *ringbuf = req->ringbuf;
+
+	iowrite32(MI_USER_INTERRUPT, ringbuf->virtual_start + req->tail - 8);
+}
+
  static int execlists_update_context(struct drm_i915_gem_request *rq)
  {
  	struct intel_engine_cs *ring = rq->ring;
@@ -433,6 +440,12 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
  			cursor->elsp_submitted = req0->elsp_submitted;
  			list_move_tail(&req0->execlist_link,
  				       &ring->execlist_retired_req_list);
+			/*
+			 * When merging requests make sure there is still
+			 * something after each batch buffer to wake up waiters.
+			 */
+			if (cursor != req0)
+				execlists_emit_user_interrupt(req0);

You may have already missed this instruction as you patch it, and keep
doing so as long as the context is resubmitted. I think to be safe, you
need to patch cursor as well. You could then MI_NOOP out the MI_INTERUPT
on the terminal request.

I don't at the moment see it could miss it? We don't do preemption, but 
granted I don't understand this code fully.

But patching it out definitely looks safer. And I even don't have to 
unbreak GuC in that case. So I'll try that approach.

An interesting igt experiement I think would be:

thread A, keep queuing batches with just a single MI_STORE_DWORD_IMM *addr
thread B, waits on batch from A, reads *addr (asynchronously), measures
latency (actual value - expected(batch))

Run for 10s, report min/max/median latency.

Repeat for more threads/contexts and more waiters. Ah, that may be the
demonstration for the thundering herd I've been looking for!

Hm I'll think about it.

Wrt your second reply, that is an interesting question.

All I can tell that empirically it looks interrupts do arrive split, 
otherwise there would be no reduction in interrupt numbers. But why are 
they split I don't know.

I'll try adding some counters to get a feel how often does that happen 
in various scenarios.

Regards,

Tvrtko



_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx