Re: [PATCH] RFC drm/i915: Slaughter the thundering i915_wait_request herd

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Fri, 27 Nov 2015 13:53:34 +0000

On 27/11/15 13:01, Chris Wilson wrote:
On Fri, Nov 27, 2015 at 12:11:32PM +0000, Tvrtko Ursulin wrote:

Hi,

On 31/10/15 10:34, Chris Wilson wrote:
One particularly stressful scenario consists of many independent tasks
all competing for GPU time and waiting upon the results (e.g. realtime
transcoding of many, many streams). One bottleneck in particular is that
each client waits on its own results, but every client is woken up after
every batchbuffer - hence the thunder of hooves as then every client must
do its heavyweight dance to read a coherent seqno to see if it is the
lucky one. Alternatively, we can have one worker responsible for wakeing
after an interrupt, checking the seqno and only wakeing up the clients
who are complete. The disadvantage is that in the uncontended scenario
(i.e. only one waiter) we incur an extra context switch in the wakeup
path - though that should be mitigated somewhat by the busywait we do
first before sleeping.

Could you explain the design in a bit more detail in the commit message?

One worker responsible for waking up after an interrupt and waking
clients who are complete?

I think more is required. At least explain the request tree, how it is 
built and used, and the flow a bit.

And add some more comments in the code, where key things are
happening, new struct members, etc?

I did.

Maybe in some patch I don't see? The posted has zero comments on new 
members added to intel_engine_cs and incomplete comments for the same in 
drm_i915_gem_request.

And the the code throughout has really to few comments relative to the 
complexity.

There is actually only one, about schedule() which is not really the 
main point! :) Nothing about the tree handling, state variables like 
irq_first and similar.

+		timer.function = NULL;
+		if (fake_irq || missed_irq(engine)) {
+			setup_timer_on_stack(&timer,
+					     (void (*)(unsigned long))fake_irq,

Kaboom when it fires? :)

It was s/fake_irq/wake_up_process/ at some point obviously forgot to
merge that back in (the whole point of all that casting).

+					     (unsigned long)current);
+			mod_timer(&timer, jiffies + 1);
+		}

I still see no benefit in complicating things with a timer. It is
just a needlessly contrived way of doing a schedule_timeout. And I
don't see we would need to extend the mechanism for more precision
since it only comes into play in such borderline conditions that it
doesn't matter.

I like the historical reasoning, having suffered the pain long ago. But
there is less reason not to use schedule_timeout() unlike the only
recently introduced io_schedule_timeout(). Fine. Have it your way!

Yay! :)

+
+		/* Unlike the individual clients, we do not want this
+		 * background thread to contribute to the system load,
+		 * i.e. we do not want to use io_schedule() here.
+		 */
+		schedule();

I am also thinking about whether busy spin makes sense more in wait
request or here. Hm.. With that optimisation which makes only the
waiter on the next request in the queue spin it is OK to do it
there.

No. I don't think that makes sense. To get here we have demonstrated that
we are in a long-wait sequence, and we already the code already has the
inclination to prevent busy-spins when we are waiting. Now we do introduce
yet another context switch between the irq and client, but most of the
latency is introduced by setting up the irq in the first place.

What doesn't make sense? To keep the busy spin, even optimized, in wait 
request after this patch?

(So please respin that patch series as well.)

In review there was only a quick rename, but I also want to change it to
a 10us timeout as that is favourable to more synchronous igt loads.

I thought I found a reversal of logic in "drm/i915: Only spin whilst 
waiting on the current request" which needs fixing?

But if that brings an improvement would a short spin be beneficial
here as well? Less so because waiters are already sleeping but could
a bit I suppose.

Indeed, much less of an improvement due to the context switches.

+void intel_engine_add_wakeup(struct drm_i915_gem_request *request)
+{
+	struct intel_engine_cs *engine = i915_gem_request_get_ring(request);
+
+	spin_lock(&engine->irq_lock);
+	if (request->irq_count++ == 0) {
+		struct rb_node **p, *parent;
+		bool first;
+
+		if (RB_EMPTY_ROOT(&engine->irq_requests))
+			schedule_work(&engine->irq_work);

Worth thinking about a dedicated, maybe even CPU local work queue
for maximum responsiveness?

Considered a kthread per ring. Downsides are that it consumes a slot for
mostly idle threads, which are what the kworkers are for. (I am not keen
on the complaints we would get for 6 kirq-i915/* threads.)

Also, I like the worker since when there is no one waiting it does not 
get woken up from notify_ring().

So was just thinking, wasn't sure, if using the system wq makes it 
compete with other jobs. If so dedicated worker wouldn't harm. Could 
make it high priority as well. Not sure it would make sense to make it 
bound any more. Probably the period for sleep to wakeup is too long for 
locality effects.

What have I considered is doing intel_engine_add_wakeup() before the
spin_request. If we have a spare CPU, that CPU will be enabling the IRQ
whilst we do the busy spin. The complication is that doing the wakeup,
schedule and rb insert isn't particularly cheap.

Yes I thought about is as well, best leave those experiments for later.

Regards,

Tvrtko

P.S. And just realised this work is competing with the scheduler which 
changes all this again.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx