Quoting Thomas Hellström (Intel) (2020-06-16 13:11:25) > Hi, > > On 6/16/20 12:12 PM, Chris Wilson wrote: > > Quoting Thomas Hellström (Intel) (2020-06-16 10:07:28) > >> Hi, Chris, > >> > >> Some comments and questions: > >> > >> On 6/8/20 12:21 AM, Chris Wilson wrote: > >>> The first "scheduler" was a topographical sorting of requests into > >>> priority order. The execution order was deterministic, the earliest > >>> submitted, highest priority request would be executed first. Priority > >>> inherited ensured that inversions were kept at bay, and allowed us to > >>> dynamically boost priorities (e.g. for interactive pageflips). > >>> > >>> The minimalistic timeslicing scheme was an attempt to introduce fairness > >>> between long running requests, by evicting the active request at the end > >>> of a timeslice and moving it to the back of its priority queue (while > >>> ensuring that dependencies were kept in order). For short running > >>> requests from many clients of equal priority, the scheme is still very > >>> much FIFO submission ordering, and as unfair as before. > >>> > >>> To impose fairness, we need an external metric that ensures that clients > >>> are interpersed, we don't execute one long chain from client A before > >>> executing any of client B. This could be imposed by the clients by using > >>> a fences based on an external clock, that is they only submit work for a > >>> "frame" at frame-interval, instead of submitting as much work as they > >>> are able to. The standard SwapBuffers approach is akin to double > >>> bufferring, where as one frame is being executed, the next is being > >>> submitted, such that there is always a maximum of two frames per client > >>> in the pipeline. Even this scheme exhibits unfairness under load as a > >>> single client will execute two frames back to back before the next, and > >>> with enough clients, deadlines will be missed. > >>> > >>> The idea introduced by BFS/MuQSS is that fairness is introduced by > >>> metering with an external clock. Every request, when it becomes ready to > >>> execute is assigned a virtual deadline, and execution order is then > >>> determined by earliest deadline. Priority is used as a hint, rather than > >>> strict ordering, where high priority requests have earlier deadlines, > >>> but not necessarily earlier than outstanding work. Thus work is executed > >>> in order of 'readiness', with timeslicing to demote long running work. > >>> > >>> The Achille's heel of this scheduler is its strong preference for > >>> low-latency and favouring of new queues. Whereas it was easy to dominate > >>> the old scheduler by flooding it with many requests over a short period > >>> of time, the new scheduler can be dominated by a 'synchronous' client > >>> that waits for each of its requests to complete before submitting the > >>> next. As such a client has no history, it is always considered > >>> ready-to-run and receives an earlier deadline than the long running > >>> requests. > >>> > >>> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>> --- > >>> drivers/gpu/drm/i915/gt/intel_engine_cs.c | 12 +- > >>> .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 1 + > >>> drivers/gpu/drm/i915/gt/intel_engine_pm.c | 4 +- > >>> drivers/gpu/drm/i915/gt/intel_engine_types.h | 24 -- > >>> drivers/gpu/drm/i915/gt/intel_lrc.c | 328 +++++++----------- > >>> drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 5 +- > >>> drivers/gpu/drm/i915/gt/selftest_lrc.c | 43 ++- > >>> .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 6 +- > >>> drivers/gpu/drm/i915/i915_priolist_types.h | 7 +- > >>> drivers/gpu/drm/i915/i915_request.h | 4 +- > >>> drivers/gpu/drm/i915/i915_scheduler.c | 322 ++++++++++++----- > >>> drivers/gpu/drm/i915/i915_scheduler.h | 22 +- > >>> drivers/gpu/drm/i915/i915_scheduler_types.h | 17 + > >>> .../drm/i915/selftests/i915_mock_selftests.h | 1 + > >>> drivers/gpu/drm/i915/selftests/i915_request.c | 1 + > >>> .../gpu/drm/i915/selftests/i915_scheduler.c | 49 +++ > >>> 16 files changed, 484 insertions(+), 362 deletions(-) > >>> create mode 100644 drivers/gpu/drm/i915/selftests/i915_scheduler.c > >> Do we have timings to back this change up? Would it make sense to have a > >> configurable scheduler choice? > > Yes, there's igt/benchmarks/gem_wsim to show the impact on scheduling > > decisions for various workloads. (You can guess what the impact of > > choosing a different execution order and forcing more context switches > > will be... About -1% to throughput with multiple clients) And > > igt/tests/gem_exec_schedule to test basic properties, with a bunch of new > > fairness tests to try and decide if this is the right thing. Under > > saturated conditions, there is no contest, a fair scheduler produces > > consistent results, and the vdeadlines allow for realtime-response under > > load. > > Yeah, it's not really to convince me, but to provide a reference for the > future. > > Perhaps add the gem_wsim timings to the commit message? I don't like posting such benchmarks without saying how they can be reproduced or providing absolute values to verify future runs. Our rules are terrible. This trimmed down set takes about a day to run, and we've yet to convince people that this is a fundamental requirement for CI. So it's really frustrating, the best we can try and do is distill essential requirements into a pass/fair test to be run on debug kernels. At least we cover the pathological exploits, except for where they are so bad CI complains for them taking too long. > >> There are multiple introductions of ktime_get() in the patch. Perhaps > >> use monotonic clock source like ktime_get_raw()? Also immediately > >> convert to ns. > > ktime_get() is monotonic. The only difference is that tkr_mono has an > > wall-offset that tkr_raw does not. [I'm sure there's a good reason.] The > > choice is really whether ktime_get_(mono|raw)_fast_ns() is sufficient for > > our needs. > > Hmm. Yes you're right. I was thinking about the NTP adjustments. But > given the requirement below they might be unimportant. I never know which is the right one to use :| Just follow the guideline that the shortest function name is the intended one to use, unless you understand why you should use one of the others. > > I do like the idea of having the deadline being some recognisable > > timestamp, as it makes it easier to play with mixing in real, albeit > > soft, deadlines. > > > > > >>> @@ -2837,10 +2788,7 @@ static void __execlists_unhold(struct i915_request *rq) > >>> GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit)); > >>> > >>> i915_request_clear_hold(rq); > >>> - list_move_tail(&rq->sched.link, > >>> - i915_sched_lookup_priolist(rq->engine, > >>> - rq_prio(rq))); > >>> - set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags); > >>> + submit |= intel_engine_queue_request(rq->engine, rq); > >> As new to this codebase, I immediately wonder whether that bitwise or is > >> intentional and whether you got the short-circuiting right. It looks > >> correct to me. > > bool submit, not many bits :) > > Yes, the code is correct. My question was related to whether it was > accepted practice, considering a future reader may think it might have > been a mistake and change it anyway.... submit |= vs if () submit = true I feel I have used both variants in this patch. > >>> { > >>> /* > >>> * When in use during submission, we are protected by a guarantee that > >>> diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c > >>> index 4c189b81cc62..30bcb6f9d99f 100644 > >>> --- a/drivers/gpu/drm/i915/i915_scheduler.c > >>> +++ b/drivers/gpu/drm/i915/i915_scheduler.c > >>> @@ -20,6 +20,11 @@ static struct i915_global_scheduler { > >>> static DEFINE_SPINLOCK(ipi_lock); > >>> static LIST_HEAD(ipi_list); > >>> > >>> +static inline u64 rq_deadline(const struct i915_request *rq) > >>> +{ > >>> + return READ_ONCE(rq->sched.deadline); > >>> +} > >>> + > >> Does this need a release barrier paired with an acquire barrier in > >> __i915_request_set_deadline below? > > No, the state can be inconsistent. If it changes as we are processing > > the previous value, there will be another reschedule. Within > > set_deadline, rq->sched.deadline is under the engine->active.lock, it is > > just that rq_deadline() is used to peek before we take the lock, as well > > as shorthand within the critical section. > > > > OK, understood. > > > >>> +static u64 prio_slice(int prio) > >>> { > >>> - const struct i915_request *inflight; > >>> + u64 slice; > >>> + int sf; > >>> > >>> /* > >>> - * We only need to kick the tasklet once for the high priority > >>> - * new context we add into the queue. > >>> + * With a 1ms scheduling quantum: > >>> + * > >>> + * MAX USER: ~32us deadline > >>> + * 0: ~16ms deadline > >>> + * MIN_USER: 1000ms deadline > >>> */ > >>> - if (prio <= engine->execlists.queue_priority_hint) > >>> - return; > >>> > >>> - rcu_read_lock(); > >>> + if (prio >= __I915_PRIORITY_KERNEL__) > >>> + return INT_MAX - prio; > >>> > >>> - /* Nothing currently active? We're overdue for a submission! */ > >>> - inflight = execlists_active(&engine->execlists); > >>> - if (!inflight) > >>> - goto unlock; > >>> + slice = __I915_PRIORITY_KERNEL__ - prio; > >>> + if (prio >= 0) > >>> + sf = 20 - 6; > >>> + else > >>> + sf = 20 - 1; > >>> + > >>> + return slice << sf; > >>> +} > >>> + > >> Is this the same deadline calculation as used in the BFS? Could you > >> perhaps add a pointer to some documentation? > > It is a heuristic. The scale factor in BFS is designed for a smaller > > range and is not effective for passing our existing priority ordering > > tests. > > > > The challenge is to pick something that is fair that roughly matches > > usage. It basically says that if client A submits 3 requests, then > > client B, C will be able to run before the later requests of client A so > > long as they are submitted within 16ms. Currently we get AAABC, > > the vdeadlines turn that into ABCAA. So we would ideally like the quota > > for each client to reflect their needs, so if client A needed all 3 > > requests within 16ms, it would have a vdeadline closer to 5ms (and so it > > would compete for the GPU against other clients). Now with this less > > strict priority system we can let normal userspace bump their > > priorities, or we can use the average context runtime to try and adjust > > priorities on the fly (i.e. do not used an unbias quota). But I suspect > > removing any fairness will skew the scheduler once more. > > OK, also for future reference, it would be good to have at least a > subset of this documented somewhere! Yeah. I think prio_slice is the best spot to try and explain why different priorities have different quotas, and the impact. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx