Re: [PATCH v3 11/14] HACK drm/i915/scheduler: emulate a scheduler for guc

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Thu, 1 Dec 2016 12:45:18 +0000

On 01/12/2016 11:18, Chris Wilson wrote:
On Thu, Dec 01, 2016 at 10:45:51AM +0000, Tvrtko Ursulin wrote:

On 14/11/2016 08:57, Chris Wilson wrote:
This emulates execlists on top of the GuC in order to defer submission of
requests to the hardware. This deferral allows time for high priority
requests to gazump their way to the head of the queue, however it nerfs
the GuC by converting it back into a simple execlist (where the CPU has
to wake up after every request to feed new commands into the GuC).

As it is starting to sink in we'll have to do add this hack sooner
or later, review comments below.

Also, would you be OK to rebase this or would prefer to delegate it?

It's been very easy to keep it current.

@@ -665,6 +665,70 @@ static void i915_guc_submit(struct drm_i915_gem_request *rq)
	spin_unlock(&client->wq_lock);
}

Confused me at first here until I noticed engine->submit_request
will be the execlist_submit_request later. Perhaps it would be good
to rename a lot of things now, like engine->request_queue,
intel_engine_submit_request and maybe more?

Looking at the lifecycle of the request and getting clear names for the
phases and their vfuncs would be sensible. It may also be worthwhile to
decide that some are not engine tasks and belong to a new entity
(dev_priv->gt.X ?)

Like submit_request? Makes sense yes.

+static bool i915_guc_dequeue(struct intel_engine_cs *engine)
+{
+	struct execlist_port *port = engine->execlist_port;
+	struct drm_i915_gem_request *last = port[0].request;
+	unsigned long flags;
+	struct rb_node *rb;
+	bool submit = false;
+
+	spin_lock_irqsave(&engine->timeline->lock, flags);
+	rb = engine->execlist_first;
+	while (rb) {
+		struct drm_i915_gem_request *cursor =
+			rb_entry(rb, typeof(*cursor), priotree.node);
+
+		if (last && cursor->ctx != last->ctx) {

Not sure if GVT comes into the picture here, but it does not sounds
like it would harm to use can_merge_ctx here?

I wasn't sure what path GVT would take either. So just went with the
simple version that looked as similar to the current guc submission as
possible. Also offloading the scheduling to the guc via semaphores will
likely make this whole chain look completely different.

Hmm I am not up to speed with that. So you are saying it doesn't make 
sense to unify this?

+			if (port != engine->execlist_port)
+				break;

It may be an overkill for the first version, but I was thinking that
we don't have to limit it to two at a time. And it would depend on
measuring of course. But perhaps it would make sense to do the
generalisation of the number of supported ports straight away.

Definitely. I was just looking at a minimal conversion, hence reusing
the existing tracking, and limits.

Definitely leave it for later, or definitely it makes sense to 
generalise right now? I was just thinking that when someone goes to test 
this and finds the throughput regresses, that it might be easier to just 
say please try i915.guc_submit_ports=8 or something.

We could theoretically share most of the execlist_dequeue and just
do a couple things differently depending on the mode.

Yes, there were just a couple of intrusive warts that I felt justified
in keeping the routines separate. But mostly it was the merge_ctx
decision that looked to be backend specific.

Cool.

Looks like one could be a new engine->submit_ports vfunc. And there
is also the lite restore WA and sw signalling to design in nicely,
but it may be worth sharing the code. It would be renamed to
sometihng like scheduler_dequeue or something.

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index cb8a75f6ca16..18dce4c66d56 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1341,8 +1341,10 @@ static void snb_gt_irq_handler(struct drm_i915_private *dev_priv,
static __always_inline void
gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
{
-	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift))
+	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) {
+		tasklet_schedule(&engine->irq_tasklet);

This would be better made conditional on GuC submission just to
calling tasklet_schedule twice (occasionally) in execlist mode.

It's not a huge cost if we do schedule the tasklet twice (in the same
interrupt context at least). Otherwise, yes we are incurring more mmio
reads to determine that the CSB didn't advance. On the other hand, I
really don't want to have if (i915.enable_guc_submission) here. Maybe
if (engine->user_irq_tasklet) tasklet_schedule(engine->user_irq_tasklet)
?

Hm, I don't know. One conditional plus a redundant engine struct member 
or just one conditional. Maybe dev_priv->gt.handle_user_interrupt and 
dev_priv->gt.handle_ctx_switch_interrupt? You won't like the indirection 
I am sure. Go with the user_irq_tasklet then.

Speaking of CSB, the spec keeps hinting that "CSB occupies a whole
cacheline, making the read cheap". I'm not aware of how we can do a UC
to cacheline read in a single transfer. Ideas?  movntdqa I thought was
specifically WC.

Unfortunately I don't know. Low-level x86 is not my strong area. I only 
learnt of movntdqa when Akash and you started using it for example.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx