[PATCH] drm/i915/execlists: Micro-optimise "idle" context switch

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Fri, 17 Aug 2018 13:24:10 +0100

On gen9, we see an effect where when we perform an element switch just
as the first context completes execution that switch takes twice as
long, as if it first reloads the completed context. That is we observe
the cost of

	context1 -> idle -> context1 ->  context2

as being twice the cost of the same operation as on gen8. The impact of
this is incredibly rare outside of microbenchmarks that are focused on
assessing the throughput of context switches.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Cc: Michał Winiarski <michal.winiarski@xxxxxxxxx>
---
I think is a microbenchmark too far, as there is no real world impact of
this as both the likelihood of submission at that precise point of time,
and the context switch being a significant fraction of the batch
runtime make the effect miniscule in practise.

It is also not foolproof for even gem_ctx_switch:
kbl ctx1 -> idle -> ctx2: ~25us;
    ctx1 -> idle -> ctx1 -> ctx2 (unpatched): ~53us
    ctx1 -> idle -> ctx1 -> ctx2 (patched): 30-40us

bxt ctx1 -> idle -> ctx2: ~40us
    ctx1 -> idle -> ctx1 -> ctx2 (unpatched): ~80
    ctx1 -> idle -> ctx1 -> ctx2 (patched): 60-70us

So consider this as more of a plea for ideas; why does bdw behaviour
better? Are we missing a flag, a fox or a chicken?
-Chris
---
 drivers/gpu/drm/i915/intel_lrc.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 36050f085071..682268d4249d 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -711,6 +711,24 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
 
 				GEM_BUG_ON(last->hw_context == rq->hw_context);
 
+				/*
+				 * Avoid reloading the previous context if we
+				 * know it has just completed and we want
+				 * to switch over to a new context. The CS
+				 * interrupt is likely waiting for us to
+				 * release the local irq lock and so we will
+				 * proceed with the submission momentarily,
+				 * which is quicker than reloading the context
+				 * on the gpu.
+				 */
+				if (!submit &&
+				    intel_engine_signaled(engine,
+							  last->global_seqno)) {
+					GEM_BUG_ON(!list_is_first(&rq->sched.link,
+								  &p->requests));
+					return;
+				}
+
 				if (submit)
 					port_assign(port, last);
 				port++;
-- 
2.18.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx