Re: [PATCH 26/28] drm/i915: Fair low-latency scheduling

Thomas Hellström (Intel) <thomas_os@xxxxxxxxxxxx> · Tue, 16 Jun 2020 14:11:25 +0200

Hi,

On 6/16/20 12:12 PM, Chris Wilson wrote:
Quoting Thomas Hellström (Intel) (2020-06-16 10:07:28)
Hi, Chris,

Some comments and questions:

On 6/8/20 12:21 AM, Chris Wilson wrote:
The first "scheduler" was a topographical sorting of requests into
priority order. The execution order was deterministic, the earliest
submitted, highest priority request would be executed first. Priority
inherited ensured that inversions were kept at bay, and allowed us to
dynamically boost priorities (e.g. for interactive pageflips).

The minimalistic timeslicing scheme was an attempt to introduce fairness
between long running requests, by evicting the active request at the end
of a timeslice and moving it to the back of its priority queue (while
ensuring that dependencies were kept in order). For short running
requests from many clients of equal priority, the scheme is still very
much FIFO submission ordering, and as unfair as before.

To impose fairness, we need an external metric that ensures that clients
are interpersed, we don't execute one long chain from client A before
executing any of client B. This could be imposed by the clients by using
a fences based on an external clock, that is they only submit work for a
"frame" at frame-interval, instead of submitting as much work as they
are able to. The standard SwapBuffers approach is akin to double
bufferring, where as one frame is being executed, the next is being
submitted, such that there is always a maximum of two frames per client
in the pipeline. Even this scheme exhibits unfairness under load as a
single client will execute two frames back to back before the next, and
with enough clients, deadlines will be missed.

The idea introduced by BFS/MuQSS is that fairness is introduced by
metering with an external clock. Every request, when it becomes ready to
execute is assigned a virtual deadline, and execution order is then
determined by earliest deadline. Priority is used as a hint, rather than
strict ordering, where high priority requests have earlier deadlines,
but not necessarily earlier than outstanding work. Thus work is executed
in order of 'readiness', with timeslicing to demote long running work.

The Achille's heel of this scheduler is its strong preference for
low-latency and favouring of new queues. Whereas it was easy to dominate
the old scheduler by flooding it with many requests over a short period
of time, the new scheduler can be dominated by a 'synchronous' client
that waits for each of its requests to complete before submitting the
next. As such a client has no history, it is always considered
ready-to-run and receives an earlier deadline than the long running
requests.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
---
   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  12 +-
   .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
   drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   4 +-
   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  24 --
   drivers/gpu/drm/i915/gt/intel_lrc.c           | 328 +++++++-----------
   drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
   drivers/gpu/drm/i915/gt/selftest_lrc.c        |  43 ++-
   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   6 +-
   drivers/gpu/drm/i915/i915_priolist_types.h    |   7 +-
   drivers/gpu/drm/i915/i915_request.h           |   4 +-
   drivers/gpu/drm/i915/i915_scheduler.c         | 322 ++++++++++++-----
   drivers/gpu/drm/i915/i915_scheduler.h         |  22 +-
   drivers/gpu/drm/i915/i915_scheduler_types.h   |  17 +
   .../drm/i915/selftests/i915_mock_selftests.h  |   1 +
   drivers/gpu/drm/i915/selftests/i915_request.c |   1 +
   .../gpu/drm/i915/selftests/i915_scheduler.c   |  49 +++
   16 files changed, 484 insertions(+), 362 deletions(-)
   create mode 100644 drivers/gpu/drm/i915/selftests/i915_scheduler.c
Do we have timings to back this change up? Would it make sense to have a
configurable scheduler choice?
Yes, there's igt/benchmarks/gem_wsim to show the impact on scheduling
decisions for various workloads. (You can guess what the impact of
choosing a different execution order and forcing more context switches
will be... About -1% to throughput with multiple clients) And
igt/tests/gem_exec_schedule to test basic properties, with a bunch of new
fairness tests to try and decide if this is the right thing. Under
saturated conditions, there is no contest, a fair scheduler produces
consistent results, and the vdeadlines allow for realtime-response under
load.

Yeah, it's not really to convince me, but to provide a reference for the 
future.

Perhaps add the gem_wsim timings to the commit message?


There are multiple introductions of ktime_get() in the patch. Perhaps
use monotonic clock source like ktime_get_raw()? Also immediately
convert to ns.
ktime_get() is monotonic. The only difference is that tkr_mono has an
wall-offset that tkr_raw does not. [I'm sure there's a good reason.] The
choice is really whether ktime_get_(mono|raw)_fast_ns() is sufficient for
our needs.

Hmm. Yes you're right. I was thinking about the NTP adjustments. But 
given the requirement below they might be unimportant.


I do like the idea of having the deadline being some recognisable
timestamp, as it makes it easier to play with mixing in real, albeit
soft, deadlines.



@@ -2837,10 +2788,7 @@ static void __execlists_unhold(struct i915_request *rq)
               GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
   
               i915_request_clear_hold(rq);
-             list_move_tail(&rq->sched.link,
-                            i915_sched_lookup_priolist(rq->engine,
-                                                       rq_prio(rq)));
-             set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+             submit |= intel_engine_queue_request(rq->engine, rq);
As new to this codebase, I immediately wonder whether that bitwise or is
intentional and whether you got the short-circuiting right. It looks
correct to me.
bool submit, not many bits :)

Yes, the code is correct. My question was related to whether it was 
accepted practice, considering a future reader may think it might have 
been a mistake and change it anyway....


   {
       /*
        * When in use during submission, we are protected by a guarantee that

diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index 4c189b81cc62..30bcb6f9d99f 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -20,6 +20,11 @@ static struct i915_global_scheduler {
   static DEFINE_SPINLOCK(ipi_lock);
   static LIST_HEAD(ipi_list);
   
+static inline u64 rq_deadline(const struct i915_request *rq)
+{
+     return READ_ONCE(rq->sched.deadline);
+}
+
Does this need a release barrier paired with an acquire barrier in
__i915_request_set_deadline below?
No, the state can be inconsistent. If it changes as we are processing
the previous value, there will be another reschedule. Within
set_deadline, rq->sched.deadline is under the engine->active.lock, it is
just that rq_deadline() is used to peek before we take the lock, as well
as shorthand within the critical section.
  

OK, understood.


+static u64 prio_slice(int prio)
   {
-     const struct i915_request *inflight;
+     u64 slice;
+     int sf;
   
       /*
-      * We only need to kick the tasklet once for the high priority
-      * new context we add into the queue.
+      * With a 1ms scheduling quantum:
+      *
+      *   MAX USER:  ~32us deadline
+      *   0:         ~16ms deadline
+      *   MIN_USER: 1000ms deadline
        */
-     if (prio <= engine->execlists.queue_priority_hint)
-             return;
   
-     rcu_read_lock();
+     if (prio >= __I915_PRIORITY_KERNEL__)
+             return INT_MAX - prio;
   
-     /* Nothing currently active? We're overdue for a submission! */
-     inflight = execlists_active(&engine->execlists);
-     if (!inflight)
-             goto unlock;
+     slice = __I915_PRIORITY_KERNEL__ - prio;
+     if (prio >= 0)
+             sf = 20 - 6;
+     else
+             sf = 20 - 1;
+
+     return slice << sf;
+}
+
Is this the same deadline calculation as used in the BFS? Could you
perhaps add a pointer to some documentation?
It is a heuristic. The scale factor in BFS is designed for a smaller
range and is not effective for passing our existing priority ordering
tests.

The challenge is to pick something that is fair that roughly matches
usage. It basically says that if client A submits 3 requests, then
client B, C will be able to run before the later requests of client A so
long as they are submitted within 16ms. Currently we get AAABC,
the vdeadlines turn that into ABCAA. So we would ideally like the quota
for each client to reflect their needs, so if client A needed all 3
requests within 16ms, it would have a vdeadline closer to 5ms (and so it
would compete for the GPU against other clients). Now with this less
strict priority system we can let normal userspace bump their
priorities, or we can use the average context runtime to try and adjust
priorities on the fly (i.e. do not used an unbias quota). But I suspect
removing any fairness will skew the scheduler once more.

OK, also for future reference, it would be good to have at least a 
subset of this documented somewhere!

Thanks,

Thomas



-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx