Re: [PATCH 08/15] drm/i915: Slaughter the thundering i915_wait_request herd

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Mon, 30 Nov 2015 13:32:16 +0000





On 30/11/15 12:30, Chris Wilson wrote:
On Mon, Nov 30, 2015 at 12:05:50PM +0000, Tvrtko Ursulin wrote:
+	struct intel_breadcrumbs {
+		spinlock_t lock; /* protects the per-engine request list */

s/list/tree/

list. We use the tree as a list!

Maybe but it confuses the reader a bit who goes and looks for some list 
then.

Related, when I was playing with this I was piggy backing on the 
existing per engine request list.

If they are seqno ordered there, and I think they are, then the waker 
thread just traverses it in order until hitting the not completed one, 
waking up every req->wait_queue as it goes.

In this case it doesn't matter that the waiters come in in 
indeterministic order so we don't need a (new) tree.

But the downside is the thread needs struct mutex. And that the 
retirement from that list is so lazy..

+		struct task_struct *task; /* bh for all user interrupts */
+		struct intel_breadcrumbs_engine {
+			struct rb_root requests; /* sorted by retirement */
+			struct drm_i915_gem_request *first;

/* First request in the tree to be completed next by the hw. */

?

/* first */ ?

Funny? :) First what? Why make the reader reverse engineer it?

+		} engine[I915_NUM_RINGS];
+	} breadcrumbs;
+
  	struct i915_virtual_gpu vgpu;

  	struct intel_guc guc;
@@ -2228,6 +2250,11 @@ struct drm_i915_gem_request {
  	/** global list entry for this request */
  	struct list_head list;

+	/** List of clients waiting for completion of this request */
+	wait_queue_head_t wait;
+	struct rb_node irq_node;
+	unsigned irq_count;

To me irq prefix does not fit here that well.

req_tree_node?

waiter_count, waiters?

waiters, breadcrumb_node, waiters_count ?

Deal.

  	for (;;) {
-		struct timer_list timer;
-
-		prepare_to_wait(&ring->irq_queue, &wait, state);
+		prepare_to_wait(&req->wait, &wait, state);

-		/* We need to check whether any gpu reset happened in between
-		 * the caller grabbing the seqno and now ... */
-		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
-			/* As we do not requeue the request over a GPU reset,
-			 * if one does occur we know that the request is
-			 * effectively complete.
-			 */
-			ret = 0;
-			break;
-		}
-
-		if (i915_gem_request_completed(req, false)) {
+		if (i915_gem_request_completed(req, true) ||

Why it is OK to change the lazy coherency mode here?

We bang on the coherency after the IRQ. Here, we are woken by the IRQ
waiter so we know that the IRQ/seqno coherency is fine and we can just
check the CPU cache.

Makes sense, put into comment? There is space now since function is much 
shorter. :)

+		    req->reset_counter != i915_reset_counter(&req->i915->gpu_error)) {

It looks like it would be valuable to preserve the comment about no
re-queuing etc on GPU reset.

I can drop it from the previous patch :)

No it is useful!

+		if (timeout) {
+			timeout_remain = io_schedule_timeout(timeout_remain);
+			if (timeout_remain == 0) {
+				ret = -ETIME;
+				break;
+			}
+		} else
+			io_schedule();

Could set timeout_remain to that MAX value when timeout == NULL and
have a single call site to io_schedule_timeout and less indentation.
It doesn't matter hugely, maybe it would be a bit easier to read.

Done. io_schedule() calls io_schedule_timeout(), whereas
schedule_timeout() calls schedule()!

Who disagreed with that? :) Ok!

@@ -900,7 +901,7 @@ static void i915_record_ring_state(struct drm_device *dev,
  		ering->instdone = I915_READ(GEN2_INSTDONE);
  	}

-	ering->waiting = waitqueue_active(&ring->irq_queue);
+	ering->waiting = READ_ONCE(dev_priv->breadcrumbs.engine[ring->id].first);

What does the READ_ONCE do here?

This element is protected by the breadcrumbs spinlock, the READ_ONCE is
documentation that we don't care about the lock here.

Hm I find it confusing but OK.

  static void notify_ring(struct intel_engine_cs *ring)
  {
-	if (!intel_ring_initialized(ring))
+	if (ring->i915 == NULL)
  		return;

  	trace_i915_gem_request_notify(ring);
-
-	wake_up_all(&ring->irq_queue);
+	wake_up_process(ring->i915->breadcrumbs.task);

For a later extension:

How would you view, some time in the future, adding a bool return to
ring->put_irq() which would say to the thread that it failed to
disable interrupts?

In this case thread would exit and would need to be restarted when
the next waiter queues up. Possibly extending the
idle->put_irq->exit durations as well then.

Honestly, not keen on the idea that the lowlevel put_irq can fail (makes
cleanup much more fraught). I don't see what improvement you intend
here, maybe clarifying that would help explain what you mean.

Don't know, maybe reflecting the fact it wasn't the last put_irq call so 
let the caller handle it if they want. We can leave this discussion for 
the future.

@@ -2986,16 +2981,16 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
  			if (ring_idle(ring, seqno)) {
  				ring->hangcheck.action = HANGCHECK_IDLE;

-				if (waitqueue_active(&ring->irq_queue)) {
+				if (READ_ONCE(dev_priv->breadcrumbs.engine[ring->id].first)) {

READ_ONCE is again strange, but other than that I don't know the
hangcheck code to really evaulate it.

Perhaps this also means this line needs a comment describing what
condition/state is actually approximating with the check.

All we are doing is asking if anyone is waiting on the GPU, because the
GPU has finished all of its work. If so, the IRQ handling is suspect
(modulo the pre-existing race condition clearly earmarked by READ_ONCE).

Cool, /* Check if someone is waiting on the GPU */ then would make me happy.

+	while (!kthread_should_stop()) {
+		unsigned reset_counter = i915_reset_counter(&i915->gpu_error);
+		unsigned long timeout_jiffies;
+		bool idling = false;
+
+		/* On every tick, walk the seqno-ordered list of requests
+		 * and for each retired request wakeup its clients. If we
+		 * find an unfinished request, go back to sleep.
+		 */

s/tick/loop/ ?
s/tick/iteration/ I'm sticking with tick

Tick makes me tick of a scheduler tick so to me it is the worst of the 
three. Iteration sounds really good. Whether that will be a 
tick/jiffie/orsomething is definite a bit lower in the code.

And if we don't find and unfinished request still go back to sleep. :)

Ok.

+		set_current_state(TASK_INTERRUPTIBLE);
+
+		/* Note carefully that we do not hold a reference for the
+		 * requests on this list. Therefore we only inspect them
+		 * whilst holding the spinlock to ensure that they are not
+		 * freed in the meantime, and the client must remove the
+		 * request from the list if it is interrupted (before it
+		 * itself releases its reference).
+		 */
+		spin_lock(&b->lock);

This lock is more per engine that global in its nature unless you
think it was more efficient to do fewer lock atomics vs potential
overlap of waiter activity per engines?

Indeed. I decided not to care about making it per-engine.

Would need a new lock for req->irq_count, per request. So
req->lock/req->irq_count and be->lock for the tree operations.

Why? A request is tied to an individual engine, therefore we can use
that breadcrumb_engine.lock for the request.

Just so 2nd+ waiters would not touch the per engine lock.

Thread would only need to deal with the later. Feels like it would
work, just not sure if it is worth it.

That was my feeling as well. Not sure if we will have huge contention
that would be solved with per-engine spinlocks, whilst we still have
struct_mutex.

+		for (i = 0; i < I915_NUM_RINGS; i++) {
+			struct intel_engine_cs *engine = &i915->ring[i];
+			struct intel_breadcrumbs_engine *be = &b->engine[i];
+			struct drm_i915_gem_request *request = be->first;
+
+			if (request == NULL) {
+				if ((irq_get & (1 << i))) {
+					if (irq_enabled & (1 << i)) {
+						engine->irq_put(engine);
+						irq_enabled &= ~ (1 << i);
+					}
+					intel_runtime_pm_put(i915);
+					irq_get &= ~(1 << i);
+				}
+				continue;
+			}
+
+			if ((irq_get & (1 << i)) == 0) {
+				intel_runtime_pm_get(i915);
+				irq_enabled |= __irq_enable(engine) << i;
+				irq_get |= 1 << i;
+			}

irq_get and irq_enabled are creating a little bit of a headache for
me. And that would probably mean a comment is be in order to explain
what they are for and how they work.

Like this I don't see the need for irq_get to persist across loops?

Because getting the irq is quite slow, we add a jiffie shadow.

But it is not a shadow in the sense of the same bitmask from the 
previous iteration. The bits are different depending on special cases in 
__irq_enable.

So it needs some comments I think.

irq_get looks like "try to get these", irq_enabled is "these ones I
got". Apart for the weird usage of it to balance the runtime pm gets
and puts.

A bit confusing as I said.

I don't like the names, but haven't thought of anything better.

+bool intel_breadcrumbs_add_waiter(struct drm_i915_gem_request *request)
+{
+	struct intel_breadcrumbs *b = &request->i915->breadcrumbs;
+	bool first = false;
+
+	spin_lock(&b->lock);
+	if (request->irq_count++ == 0) {
+		struct intel_breadcrumbs_engine *be =
+			&b->engine[request->ring->id];
+		struct rb_node **p, *parent;
+
+		if (be->first == NULL)
+			wake_up_process(b->task);

With a global b->lock

or even per engine

it makes no difference but in the logical
sense this would usually go last, after the request has been
inserted into the tree and waiter initialized.

Except that the code is simpler with kicking the task first.

Since we both thought it might make sense with a per-engine lock, let's
go with it. Maybe one day it will pay off.

Ok.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx