Re: [PATCH v7] drm/i915: Slaughter the thundering i915_wait_request herd

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 8 Dec 2015 14:03:39 +0000

On 08/12/15 10:44, Chris Wilson wrote:
On Mon, Dec 07, 2015 at 03:08:28PM +0000, Tvrtko Ursulin wrote:
Hi,

On 03/12/15 16:22, Chris Wilson wrote:
One particularly stressful scenario consists of many independent tasks
all competing for GPU time and waiting upon the results (e.g. realtime
transcoding of many, many streams). One bottleneck in particular is that
each client waits on its own results, but every client is woken up after
every batchbuffer - hence the thunder of hooves as then every client must
do its heavyweight dance to read a coherent seqno to see if it is the
lucky one.

Ideally, we only want one client to wake up after the interrupt and
check its request for completion. Since the requests must retire in
order, we can select the first client on the oldest request to be woken.
Once that client has completed his wait, we can then wake up the
next client and so on.

v2: Convert from a kworker per engine into a dedicated kthread for the
bottom-half.
v3: Rename request members and tweak comments.
v4: Use a per-engine spinlock in the breadcrumbs bottom-half.
v5: Fix race in locklessly checking waiter status and kicking the task on
adding a new waiter.
v6: Fix deciding when to force the timer to hide missing interrupts.
v7: Move the bottom-half from the kthread to the first client process.
I like the idea a lot so I'll make some comments even before you
post v9 or later. :)

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@xxxxxxxxx>
Cc: "Gong, Zhipeng" <zhipeng.gong@xxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
Cc: Dave Gordon <david.s.gordon@xxxxxxxxx>
---
  drivers/gpu/drm/i915/Makefile            |   1 +
  drivers/gpu/drm/i915/i915_debugfs.c      |   2 +-
  drivers/gpu/drm/i915/i915_drv.h          |   3 +-
  drivers/gpu/drm/i915/i915_gem.c          | 108 ++++++-----------
  drivers/gpu/drm/i915/i915_gpu_error.c    |   2 +-
  drivers/gpu/drm/i915/i915_irq.c          |  17 +--
  drivers/gpu/drm/i915/intel_breadcrumbs.c | 197 +++++++++++++++++++++++++++++++
  drivers/gpu/drm/i915/intel_lrc.c         |   5 +-
  drivers/gpu/drm/i915/intel_ringbuffer.c  |   5 +-
  drivers/gpu/drm/i915/intel_ringbuffer.h  |  47 +++++++-
  10 files changed, 297 insertions(+), 90 deletions(-)
  create mode 100644 drivers/gpu/drm/i915/intel_breadcrumbs.c

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 0851de07bd13..d3b9d3618719 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -35,6 +35,7 @@ i915-y += i915_cmd_parser.o \
  	  i915_gem_userptr.o \
  	  i915_gpu_error.o \
  	  i915_trace_points.o \
+	  intel_breadcrumbs.o \
  	  intel_lrc.o \
  	  intel_mocs.o \
  	  intel_ringbuffer.o \
diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 0583eda642d7..68fbe4d1f91d 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -2323,7 +2323,7 @@ static int count_irq_waiters(struct drm_i915_private *i915)
  	int i;

  	for_each_ring(ring, i915, i)
-		count += ring->irq_refcount;
+		count += intel_engine_has_waiter(ring);
Don't care how many any longer? Okay in debugfs, but also in error capture?
It was nice that we were able to mark the requests with waiters for
free. We could add that, but it is no longer a side-effect of this
patch. Here I was tempted to do an intel_engine_count_breadcrumbs(), but
that doesn't match how the information is used by RPS so adding it to
debugfs for RPS is misleading.

-	/* Optimistic spin for the next jiffie before touching IRQs */
-	ret = __i915_spin_request(req, state);
-	if (ret == 0)
-		goto out;
+	if (INTEL_INFO(req->i915)->gen >= 6)
+		gen6_rps_boost(req->i915, rps, req->emitted_jiffies);

-	if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
-		ret = -ENODEV;
-		goto out;
+	/* Optimistic spin for the next jiffie before touching IRQs */
Exactly the opposite! :)
Bah. It's close, it certainly succinctly captures the intent if not the
implementation! s/jiffie/~jiffie/!

@@ -996,12 +996,11 @@ static void ironlake_rps_change_irq_handler(struct drm_device *dev)

  static void notify_ring(struct intel_engine_cs *ring)
  {
-	if (!intel_ring_initialized(ring))
+	if (ring->i915 == NULL)
Glanced something about this in some other thread, maybe best to
keep it consolidated in one patch?
This can die now. It was justifiable when I was using ring->i915
directly here, but now it is just (mostly) pointless churn.

+static bool irq_get(struct intel_engine_cs *engine)
+{
+	if (test_bit(engine->id, &engine->i915->gpu_error.test_irq_rings))
+		return false;
I don't understand why knowledge of this debugfs interface needs to
be sprinkled around?
It's not though. This is the only use - the interface is to inject
faults into the waiter that ensure we cannot wake up after an
interrupt, and so force the hangcheck to intervene. As you thinking of
its counterpart missed_irq, which we use to track when we need to force
the fake interrupt?

I mean, I am not even 100% sure what the
debugfs interface is for. But it looks to be about software masking
user interrupts on a ring? Why not just mask the interrupt delivery
based on this mask at the source?
It is to simulate a broken wait, as opposed to masking interrupts.
I don't see the difference between a broken wait and masking interrupts?

It is masking interrupts before and after this patch. Broken wait could 
have been accomplished by doing the test from wait when 
request_completed is called.
Anyway, I don't get it.

+bool intel_breadcrumbs_add_waiter(struct drm_i915_gem_request *request,
+				  struct intel_breadcrumb *wait)
+{
+	struct intel_breadcrumbs *b = &request->ring->breadcrumbs;
+	struct rb_node **p, *parent;
+	bool first = false;
+
+	wait->task = current;
+	wait->seqno = request->seqno;
+
+	spin_lock(&b->lock);
+
+	/* First? Unmask the user interrupt */
+	if (b->first_waiter == NULL)
+		queue_delayed_work(system_highpri_wq, &b->irq, 0);
Who enables interrupts if the worker gets scheduled and completes
before settig b->first_waiter below?
In this version, it is serialised by the spinlock. In the next version,
the ordering is explicit (no more delayed worker).

+	/* Insert the request into the retirment ordered list
+	 * of waiters by walking the rbtree. If we are the oldest
+	 * seqno in the tree (the first to be retired), then
+	 * set ourselves as the bottom-half.
+	 */
+	first = true;
+	parent = NULL;
+	p = &b->requests.rb_node;
+	while (*p) {
+		parent = *p;
+		if (i915_seqno_passed(wait->seqno, to_wait(parent)->seqno)) {
+			p = &parent->rb_right;
+			first = false;
+		} else
+			p = &parent->rb_left;
+	}
+	rb_link_node(&wait->node, parent, p);
+	rb_insert_color(&wait->node, &b->requests);
WARN_ON(!first && !b->first_waiter) maybe?
We will get a GPF on b->first_waiter soon enough :)

+	if (first)
+		b->first_waiter = wait->task;
BUG_ON(b->first_waiter == NULL);

would be clearer I guess.

+
+	spin_unlock(&b->lock);
+
+	return first;
+}
+
+void intel_breadcrumbs_remove_waiter(struct drm_i915_gem_request *request,
+				     struct intel_breadcrumb *wait)
+{
+	struct intel_breadcrumbs *b = &request->ring->breadcrumbs;
+
+	spin_lock(&b->lock);
+
+	if (b->first_waiter == wait->task) {
+		struct intel_breadcrumb *next;
+		struct task_struct *task;
+
+		/* We are the current bottom-half. Find the next candidate,
+		 * the first waiter in the queue on the remaining oldest
+		 * request. As multiple seqnos may complete in the time it
+		 * takes us to wake up and find the next waiter, we have to
+		 * wake up that waiter for it to perform its own coherent
+		 * completion check.
+		 */
Equally, why wouldn't we wake up all waiters for which the requests
have been completed?
Because we no longer track the requests to be completed, having moved to
a chain of waiting processes instead of a chain of requests. I could
insert a waitqueue into the intel_breadcrumb and that would indeed
necessitate locking in the irq handler (and irq locks everywhere :(.
You have a tree of seqnos each with a wait->task and current seqno on 
the engine can be queried. So I don't see where is the problem?
Would be a cheap check here and it would save a cascading growing
latency as one task wakes up the next one.
Well, it can't be here since we may remove_waiter after a signal
(incomplete wait). So this part has to walk the chain of processes. Ugh,
and have to move the waitqueue from one waiter to the next...
Ok on interrupted waiters it makes no sense, but on normal waiter 
removal it would just mean comparing engine->get_seqno() vs the first 
waiter seqno and waking up all until the uncompleted one is found?
Even more benefit for multiple waiters on the same request.
Yes. That's what I liked about the previous version with the independent
waitqueue. However, latency of a single waiter is a compelling argument.

My head is still full of cold, maybe embedding a waitqueue into the
breadcrumb is an improvement - I am not sure yet.
Leave it for some other day then.

+	/* Rather than have every client wait upon all user interrupts,
+	 * with the herd waking after every interrupt and each doing the
+	 * heavyweight seqno dance, we delegate the task to the first client.
+	 * After the interrupt, we wake up one client, who does the heavyweight
+	 * coherent seqno read and either goes back to sleep (if incomplete),
+	 * or wakes up the next client in the queue and so forth.
+	 *
+	 * With respect to walking the entire list of waiters in a single
s/With respect to/In contrast to/ or "Compared to"?
"Compared to", thanks.

+	 * dedicated bottom-half, we reduce the latency of the first waiter
+	 * by avoiding a context switch, but incur additional coherent seqno
+	 * reads when following the chain of request breadcrumbs.
+	 */
+	struct intel_breadcrumbs {
+		spinlock_t lock; /* protects the lists of requests */
+		struct rb_root requests; /* sorted by retirement */
+		struct task_struct *first_waiter; /* bh for user interrupts */
+		struct delayed_work irq; /* used to enable or fake inerrupts */
+		bool irq_enabled : 1;
+		bool rpm_wakelock : 1;
Are bitfields worth it? There aren't that many engine so maybe it
loses more in terms of instructions generated.
I'm always optimistic that gcc gets it's act together. Not that I don't
have patches converting from bitfields to flags when gcc hurts. Here,
the potential extra instruction is going to dwarfed by the irq spinlocks
and worse along these paths.

+static inline void intel_engine_wakeup(struct intel_engine_cs *engine)
+{
+	struct task_struct *task = READ_ONCE(engine->breadcrumbs.first_waiter);
+	if (task)
+		wake_up_process(task);
What guarantees task is a valid task at this point?
Not an awful lot! Indirectly, I can point to the that the task struct
cannot be freed whilst we are in an irq (thanks to rcu), but other than
that it is simply that there is a much longer path between clearing the
first_waiter and freeing the task_struct than doing a wake_up_process on
the running task.

I know it is so extremely unlikely, but it can theoretically sample
the first_waiter which then exists and disappears bafore the
wake_up_process call.
I can leave a comment: "I told you so -- Tvrtko" :)
Hm, it possibly needs a synchronize_irq before remove_waiter exits then?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx