Re: [PATCH 26/28] drm/i915: Fair low-latency scheduling

Thomas Hellström (Intel) <thomas_os@xxxxxxxxxxxx> · Tue, 16 Jun 2020 11:07:28 +0200

Hi, Chris,

Some comments and questions:

On 6/8/20 12:21 AM, Chris Wilson wrote:
The first "scheduler" was a topographical sorting of requests into
priority order. The execution order was deterministic, the earliest
submitted, highest priority request would be executed first. Priority
inherited ensured that inversions were kept at bay, and allowed us to
dynamically boost priorities (e.g. for interactive pageflips).

The minimalistic timeslicing scheme was an attempt to introduce fairness
between long running requests, by evicting the active request at the end
of a timeslice and moving it to the back of its priority queue (while
ensuring that dependencies were kept in order). For short running
requests from many clients of equal priority, the scheme is still very
much FIFO submission ordering, and as unfair as before.

To impose fairness, we need an external metric that ensures that clients
are interpersed, we don't execute one long chain from client A before
executing any of client B. This could be imposed by the clients by using
a fences based on an external clock, that is they only submit work for a
"frame" at frame-interval, instead of submitting as much work as they
are able to. The standard SwapBuffers approach is akin to double
bufferring, where as one frame is being executed, the next is being
submitted, such that there is always a maximum of two frames per client
in the pipeline. Even this scheme exhibits unfairness under load as a
single client will execute two frames back to back before the next, and
with enough clients, deadlines will be missed.

The idea introduced by BFS/MuQSS is that fairness is introduced by
metering with an external clock. Every request, when it becomes ready to
execute is assigned a virtual deadline, and execution order is then
determined by earliest deadline. Priority is used as a hint, rather than
strict ordering, where high priority requests have earlier deadlines,
but not necessarily earlier than outstanding work. Thus work is executed
in order of 'readiness', with timeslicing to demote long running work.

The Achille's heel of this scheduler is its strong preference for
low-latency and favouring of new queues. Whereas it was easy to dominate
the old scheduler by flooding it with many requests over a short period
of time, the new scheduler can be dominated by a 'synchronous' client
that waits for each of its requests to complete before submitting the
next. As such a client has no history, it is always considered
ready-to-run and receives an earlier deadline than the long running
requests.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
---
  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  12 +-
  .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
  drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   4 +-
  drivers/gpu/drm/i915/gt/intel_engine_types.h  |  24 --
  drivers/gpu/drm/i915/gt/intel_lrc.c           | 328 +++++++-----------
  drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
  drivers/gpu/drm/i915/gt/selftest_lrc.c        |  43 ++-
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   6 +-
  drivers/gpu/drm/i915/i915_priolist_types.h    |   7 +-
  drivers/gpu/drm/i915/i915_request.h           |   4 +-
  drivers/gpu/drm/i915/i915_scheduler.c         | 322 ++++++++++++-----
  drivers/gpu/drm/i915/i915_scheduler.h         |  22 +-
  drivers/gpu/drm/i915/i915_scheduler_types.h   |  17 +
  .../drm/i915/selftests/i915_mock_selftests.h  |   1 +
  drivers/gpu/drm/i915/selftests/i915_request.c |   1 +
  .../gpu/drm/i915/selftests/i915_scheduler.c   |  49 +++
  16 files changed, 484 insertions(+), 362 deletions(-)
  create mode 100644 drivers/gpu/drm/i915/selftests/i915_scheduler.c

Do we have timings to back this change up? Would it make sense to have a 
configurable scheduler choice?

@@ -1096,22 +1099,30 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
  {
  	struct i915_request *rq, *rn, *active = NULL;
  	struct list_head *uninitialized_var(pl);
-	int prio = I915_PRIORITY_INVALID;
+	u64 deadline = I915_DEADLINE_NEVER;
  
  	lockdep_assert_held(&engine->active.lock);
  
  	list_for_each_entry_safe_reverse(rq, rn,
  					 &engine->active.requests,
  					 sched.link) {
-		if (i915_request_completed(rq))
+		if (i915_request_completed(rq)) {
+			list_del_init(&rq->sched.link);
  			continue; /* XXX */
+		}

Is this an unrelated change? If so separate patch?

...


@@ -2162,14 +2140,13 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
  			__unwind_incomplete_requests(engine);
  
  			last = NULL;
-		} else if (need_timeslice(engine, last, ve) &&
-			   timeslice_expired(execlists, last)) {
+		} else if (timeslice_expired(engine, last)) {
  			ENGINE_TRACE(engine,
-				     "expired last=%llx:%lld, prio=%d, hint=%d, yield?=%s\n",
-				     last->fence.context,
-				     last->fence.seqno,
-				     last->sched.attr.priority,
-				     execlists->queue_priority_hint,
+				     "expired:%s last=%llx:%llu, deadline=%llu, now=%llu, yield?=%s\n",
+				     yesno(timer_expired(&execlists->timer)),
+				     last->fence.context, last->fence.seqno,
+				     rq_deadline(last),
+				     i915_sched_to_ticks(ktime_get()),
  				     yesno(timeslice_yield(execlists, last)));

There are multiple introductions of ktime_get() in the patch. Perhaps 
use monotonic clock source like ktime_get_raw()? Also immediately 
convert to ns.

...

@@ -2837,10 +2788,7 @@ static void __execlists_unhold(struct i915_request *rq)
  		GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
  
  		i915_request_clear_hold(rq);
-		list_move_tail(&rq->sched.link,
-			       i915_sched_lookup_priolist(rq->engine,
-							  rq_prio(rq)));
-		set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+		submit |= intel_engine_queue_request(rq->engine, rq);

As new to this codebase, I immediately wonder whether that bitwise or is 
intentional and whether you got the short-circuiting right. It looks 
correct to me.

...

diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 118ab6650d1f..23594e712292 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -561,7 +561,7 @@ static inline void i915_request_clear_hold(struct i915_request *rq)
  }
  
  static inline struct intel_timeline *
-i915_request_timeline(struct i915_request *rq)
+i915_request_timeline(const struct i915_request *rq)
  {
  	/* Valid only while the request is being constructed (or retired). */
  	return rcu_dereference_protected(rq->timeline,
@@ -576,7 +576,7 @@ i915_request_gem_context(struct i915_request *rq)
  }
  
  static inline struct intel_timeline *
-i915_request_active_timeline(struct i915_request *rq)
+i915_request_active_timeline(const struct i915_request *rq)

Are these unrelated? Separate patch?



  {
  	/*
  	 * When in use during submission, we are protected by a guarantee that
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index 4c189b81cc62..30bcb6f9d99f 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -20,6 +20,11 @@ static struct i915_global_scheduler {
  static DEFINE_SPINLOCK(ipi_lock);
  static LIST_HEAD(ipi_list);
  
+static inline u64 rq_deadline(const struct i915_request *rq)
+{
+	return READ_ONCE(rq->sched.deadline);
+}
+

Does this need a release barrier paired with an acquire barrier in 
__i915_request_set_deadline below?

+
+static bool __i915_request_set_deadline(struct i915_request *rq, u64 deadline)
+{
+	struct intel_engine_cs *engine = rq->engine;
+	struct i915_request *rn;
+	struct list_head *plist;
+	LIST_HEAD(dfs);
+
+	lockdep_assert_held(&engine->active.lock);
+	list_add(&rq->sched.dfs, &dfs);
+
+	list_for_each_entry(rq, &dfs, sched.dfs) {
+		struct i915_dependency *p;
+
+		GEM_BUG_ON(rq->engine != engine);
+
+		for_each_signaler(p, rq) {
+			struct i915_request *s =
+				container_of(p->signaler, typeof(*s), sched);
+
+			GEM_BUG_ON(s == rq);
+
+			if (rq_deadline(s) <= deadline)
+				continue;
+
+			if (i915_request_completed(s))
+				continue;
+
+			if (s->engine != rq->engine) {
+				spin_lock(&ipi_lock);
+				if (deadline < p->ipi_deadline) {
+					p->ipi_deadline = deadline;
+					list_move(&p->ipi_link, &ipi_list);
+					irq_work_queue(&ipi_work);
+				}
+				spin_unlock(&ipi_lock);
+				continue;
+			}
+
+			list_move_tail(&s->sched.dfs, &dfs);
+		}
+	}
+
+	plist = i915_sched_lookup_priolist(engine, deadline);
+
+	/* Fifo and depth-first replacement ensure our deps execute first */
+	list_for_each_entry_safe_reverse(rq, rn, &dfs, sched.dfs) {
+		GEM_BUG_ON(rq->engine != engine);
+		GEM_BUG_ON(deadline > rq_deadline(rq));
+
+		INIT_LIST_HEAD(&rq->sched.dfs);
+		WRITE_ONCE(rq->sched.deadline, deadline);

An smp barrier needed?

...

+static u64 prio_slice(int prio)
  {
-	const struct i915_request *inflight;
+	u64 slice;
+	int sf;
  
  	/*
-	 * We only need to kick the tasklet once for the high priority
-	 * new context we add into the queue.
+	 * With a 1ms scheduling quantum:
+	 *
+	 *   MAX USER:  ~32us deadline
+	 *   0:         ~16ms deadline
+	 *   MIN_USER: 1000ms deadline
  	 */
-	if (prio <= engine->execlists.queue_priority_hint)
-		return;
  
-	rcu_read_lock();
+	if (prio >= __I915_PRIORITY_KERNEL__)
+		return INT_MAX - prio;
  
-	/* Nothing currently active? We're overdue for a submission! */
-	inflight = execlists_active(&engine->execlists);
-	if (!inflight)
-		goto unlock;
+	slice = __I915_PRIORITY_KERNEL__ - prio;
+	if (prio >= 0)
+		sf = 20 - 6;
+	else
+		sf = 20 - 1;
+
+	return slice << sf;
+}
+

Is this the same deadline calculation as used in the BFS? Could you 
perhaps add a pointer to some documentation?


/Thomas


_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx