Re: [PATCH 17/21] drm/i915: Convert trace-irq to the breadcrumb waiter

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Wed, 8 Jun 2016 13:44:27 +0100

On 08/06/16 13:34, Chris Wilson wrote:
On Wed, Jun 08, 2016 at 12:47:28PM +0100, Tvrtko Ursulin wrote:

On 08/06/16 12:24, Chris Wilson wrote:
On Wed, Jun 08, 2016 at 11:16:13AM +0100, Tvrtko Ursulin wrote:

On 08/06/16 10:48, Chris Wilson wrote:
On Tue, Jun 07, 2016 at 01:04:22PM +0100, Tvrtko Ursulin wrote:
+static int intel_breadcrumbs_signaler(void *arg)
+{
+	struct intel_engine_cs *engine = arg;
+	struct intel_breadcrumbs *b = &engine->breadcrumbs;
+	struct signal *signal;
+
+	/* Install ourselves with high priority to reduce signalling latency */
+	signaler_set_rtpriority();
+
+	do {
+		set_current_state(TASK_INTERRUPTIBLE);
+
+		/* We are either woken up by the interrupt bottom-half,
+		 * or by a client adding a new signaller. In both cases,
+		 * the GPU seqno may have advanced beyond our oldest signal.
+		 * If it has, propagate the signal, remove the waiter and
+		 * check again with the next oldest signal. Otherwise we
+		 * need to wait for a new interrupt from the GPU or for
+		 * a new client.
+		 */
+		signal = READ_ONCE(b->first_signal);
+		if (signal_complete(signal)) {
+			/* Wake up all other completed waiters and select the
+			 * next bottom-half for the next user interrupt.
+			 */
+			intel_engine_remove_wait(engine, &signal->wait);
+
+			i915_gem_request_unreference(signal->request);
+
+			/* Find the next oldest signal. Note that as we have
+			 * not been holding the lock, another client may
+			 * have installed an even older signal than the one
+			 * we just completed - so double check we are still
+			 * the oldest before picking the next one.
+			 */
+			spin_lock(&b->lock);
+			if (signal == b->first_signal)
+				b->first_signal = rb_next(&signal->node);
+			rb_erase(&signal->node, &b->signals);
+			spin_unlock(&b->lock);
+
+			kfree(signal);
+		} else {
+			if (kthread_should_stop())
+				break;
+
+			schedule();
+		}
+	} while (1);
+
+	return 0;
+}

So the thread is only because it is convenient to plug it in the
breadcrumbs infrastructure. Otherwise the processing above could be
done from a lighter weight context as well since nothing seems to
need the process context.

No, seqno processing requires process/sleepable context. The delays we
incur can be >100us and not suitable for irq/softirq context.

Nothing in this patch needs it - please say in the commit why it is
choosing the process context then.

Bottom half processing requires it. irq_seqno_barrier is not suitable
for irq/softirq context.

Why? Because of a single clflush? How long does that take?

Because both Ironlake and Baytrail require definite delays on the order of
100us. Haswell, Broadwell, Skylake all need an extra delay that we don't
yet have.

Okay, please mention in the commit so the choice is documented.

And why so long delays? It looks pretty lightweight to me.

One alternative could perhaps be to add a waiter->wake_up vfunc and
signalers could then potentially use a tasklet?

Hmm, I did find that in order to reduce execlists latency, I had to
drive the tasklet processing from the signaler.

What do you mean? The existing execlists tasklet? Now would that work?

Due to how dma-fence signals, the softirq is never kicked
(spin_lock_irq doesn't handle local_bh_enable()) and so we would only
submit a new task via execlists on a reschedule. That latency added
about 30% (30s on bsw) to gem_exec_parallel.

I don't follow. User interrupts are separate from context complete
which drives the submission. How do fences interfere with the
latter?

The biggest user benchmark (ala sysmark) regression we have for
execlists is the latency in submitting the first request to hardware via
elsp (or at least the hw responding to and executing that batch,
the per-bb and per-ctx w/a are not free either). If we incur extra
latency in the driver in even adding the request to the queue for an
idle GPU, that is easily felt by userspace.

I still don't see how fences tie into that. But it is not so important 
since it was all along the lines of "do we really need a thread".

+int intel_engine_enable_signaling(struct drm_i915_gem_request *request)
+{
+	struct intel_engine_cs *engine = request->engine;
+	struct intel_breadcrumbs *b = &engine->breadcrumbs;
+	struct rb_node *parent, **p;
+	struct signal *signal;
+	bool first, wakeup;
+
+	if (unlikely(IS_ERR(b->signaler)))
+		return PTR_ERR(b->signaler);

I don't see that there is a fallback is kthread creation failed. It
should just fail in intel_engine_init_breadcrumbs if that happens.

Because it is not fatal to using the GPU, just one optional function.

But we never expect it to fail and it is not even dependent on
anything user controllable. Just a random error which would cause
user experience to degrade. If thread creation failed it means
system is in such a poor shape I would just fail the driver init.

A minimally functional system is better than nothing at all.
GEM is not required for driver loading, interrupt driven dma-fences less
so.

If you are so hot for that, how about vfuncing enable signaling in
that case? Because I find the "have we created our kthread at driver
init time successfuly" question for every fence a bit too much.

read + conditional that pulls in the cacheline we want? You can place
the test after the spinlock if you want to avoid the cost I supose.
Or we just mark the GPU as wedged.

What I meant was to pass in different fence_ops at fence_init time 
depending on whether or not signaler thread was created or not. If 
driver is wanted to be functional in that case, and 
fence->enable_signaling needs to keep returning errors, it sound like a 
much more elegant solution than to repeating the check at every 
fence->enable_signaling call.

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx