Quoting Francisco Jerez (2020-03-23 22:30:13) > Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > > > Quoting Francisco Jerez (2020-03-20 22:14:51) > >> Francisco Jerez <currojerez@xxxxxxxxxx> writes: > >> > >> > Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > >> > > >> >> We dropped calling process_csb prior to handling direct submission in > >> >> order to avoid the nesting of spinlocks and lift process_csb() and the > >> >> majority of the tasklet out of irq-off. However, we do want to avoid > >> >> ksoftirqd latency in the fast path, so try and pull the interrupt-bh > >> >> local to direct submission if we can acquire the tasklet's lock. > >> >> > >> >> v2: Tweak the balance to avoid over submitting lite-restores > >> >> > >> >> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >> >> Cc: Francisco Jerez <currojerez@xxxxxxxxxx> > >> >> Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> > >> >> --- > >> >> drivers/gpu/drm/i915/gt/intel_lrc.c | 44 ++++++++++++++++++++------ > >> >> drivers/gpu/drm/i915/gt/selftest_lrc.c | 2 +- > >> >> 2 files changed, 36 insertions(+), 10 deletions(-) > >> >> > >> >> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > >> >> index f09dd87324b9..dceb65a0088f 100644 > >> >> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > >> >> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > >> >> @@ -2884,17 +2884,17 @@ static void queue_request(struct intel_engine_cs *engine, > >> >> set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags); > >> >> } > >> >> > >> >> -static void __submit_queue_imm(struct intel_engine_cs *engine) > >> >> +static bool pending_csb(const struct intel_engine_execlists *el) > >> >> { > >> >> - struct intel_engine_execlists * const execlists = &engine->execlists; > >> >> + return READ_ONCE(*el->csb_write) != READ_ONCE(el->csb_head); > >> >> +} > >> >> > >> >> - if (reset_in_progress(execlists)) > >> >> - return; /* defer until we restart the engine following reset */ > >> >> +static bool skip_lite_restore(struct intel_engine_execlists *el, > >> >> + const struct i915_request *rq) > >> >> +{ > >> >> + struct i915_request *inflight = execlists_active(el); > >> >> > >> >> - if (execlists->tasklet.func == execlists_submission_tasklet) > >> >> - __execlists_submission_tasklet(engine); > >> >> - else > >> >> - tasklet_hi_schedule(&execlists->tasklet); > >> >> + return inflight && inflight->context == rq->context; > >> >> } > >> >> > >> >> static void submit_queue(struct intel_engine_cs *engine, > >> >> @@ -2905,8 +2905,34 @@ static void submit_queue(struct intel_engine_cs *engine, > >> >> if (rq_prio(rq) <= execlists->queue_priority_hint) > >> >> return; > >> >> > >> >> + if (reset_in_progress(execlists)) > >> >> + return; /* defer until we restart the engine following reset */ > >> >> + > >> >> + /* > >> >> + * Suppress immediate lite-restores, leave that to the tasklet. > >> >> + * > >> >> + * However, we leave the queue_priority_hint unset so that if we do > >> >> + * submit a second context, we push that into ELSP[1] immediately. > >> >> + */ > >> >> + if (skip_lite_restore(execlists, rq)) > >> >> + return; > >> >> + > >> > Why do you need to treat lite-restore specially here? > > > > Lite-restore have a noticeable impact on no-op loads. A part of that is > > that a lite-restore is about 1us, and the other part is that the driver > > has a lot more work to do. There's a balance point around here for not > > needlessly interrupting ourselves and ensuring that there is no bubble. > > > > Oh, I see. But isn't inhibiting the lite restore likely to be fairly > costly in some cases as well if it causes the GPU to go idle after the > current context completes for as long as it takes the CPU to wake up, > process the IRQ and dequeue the next request? Yes. It's an amusing diversion to try and think how can we do a single context submission (well 2) for a sequence of requests, for those clients that like to submit N batches at once. From an idle GPU, assuming each batch is non-trivial, we want to do something like submit batch 1, accumulate, then submit batches 1-N, i.e. skip the intervening lite-restores but complete the submission with all the work queued. > Would it make sense to > inhibit lite-restore in roughly the same conditions I set the overload > flag? (since that indicates we'll get an IRQ at least one request > *before* the GPU actually goes idle, so there shouldn't be any penalty > from inhibiting lite restore). Yes/no. Once we have multiple contexts scheduled, we won't be doing lite restores. The key problem is that the irq/tasklet-bh latency is unpredictable. Only submitting one context at a time costs about 1% in [multi-context] transcode throughput. And the cost of lite-restore on that scale is barely noticeable. So it annoys me that we can measure the impact of lite-restore on nop-throughput, but in reality we have a self-inflicted regression on top of the lite-restore that caught my eye. Since we don't resubmit more contexts until we receive an ack from the HW, the latency in processing that ack actually allowed us to accumulate more nops into a single submission. Process that ack earlier and we start submitting each nop individually. But we do want to process that ack earlier as we know we are handling a request that should be pushed to the GPU immediately. [The age old adage, batching submissions is good for throughput, bad for latency. And I have to pinch myself that throughput is easier to measure, but latency is the crucial metric.] > >> > Anyway, trying this out now in combination with my patches now. > >> > > >> > >> This didn't seem to help (together with your other suggestion to move > >> the overload accounting to __execlists_schedule_in/out). And it makes > >> the current -5% SynMark OglMultithread regression with my series go down > >> to -10%. My previous suggestion of moving the > >> intel_gt_pm_active_begin() call to process_csb() when the submission is > >> ACK'ed by the hardware does seem to help (and it roughly halves the > >> OglMultithread regression), possibly because that way we're able to > >> determine whether the first context was actually overlapping at the > >> point that the second was received by the hardware -- I haven't tested > >> it extensively yet though. > > > > Grumble, it just seems like we are setting and clearing the flag on > > completely unrelated events -- which I still think boils down to working > > around latency in the driver. Or at least I hope there's an explanation > > and bug to fix that improves responsiveness for all. > > -Chris > > There *might* be a performance issue somewhere introducing latency that > the instrumentation I added happens to mitigate, but isn't that a sign > that it's fulfilling its purpose of determining when the workload could > be sensitive to CPU latency? We have latency issues in the tasklet submission. The irq-off lock contention along the submission path [execlists_dequeue] is perhaps the biggest issue in the driver (at least from lockstats perspective). My expectation is that the delay in processing the CSB is affecting the 'overload' heuristic. > Maybe I didn't explain the idea properly: Given that command submission > is asynchronous with the CS processing the previous context, there is no > way for us to tell whether a request we just submitted was actually > overlapping with the previous one until we read the CSB and see whether > it led to an idle-to-active transition. Only then can we assume that > the CPU is sending commands to the GPU quickly enough to keep it busy > without interruption. That sounds like you just want to use the C0 counters. But if you want to use the active/idle state as your heuristic, then you want something like diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index 3102159d2b3b..3e08ecd53ecb 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2443,6 +2443,8 @@ static void process_csb(struct intel_engine_cs *engine) GEM_BUG_ON(execlists->active - execlists->inflight > execlists_num_ports(execlists)); + + WRITE_ONCE(execlists->overload, !!*execlists->active); } } while (head != tail); That will be set to true when we have enough work to keep the GPU busy into the second context, and only be set to false when the GPU idles. However, just because we switched contexts does not necessarily imply that the previous context did substantial work. And there may be lots of fun with timeslicing preemptions that do not let a single context run to completion. So perhaps, diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index 3102159d2b3b..05bb533d556c 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2340,6 +2340,7 @@ static void process_csb(struct intel_engine_cs *engine) rmb(); ENGINE_TRACE(engine, "cs-irq head=%d, tail=%d\n", head, tail); + bool overload = *execlists->active; do { bool promote; @@ -2443,8 +2444,11 @@ static void process_csb(struct intel_engine_cs *engine) GEM_BUG_ON(execlists->active - execlists->inflight > execlists_num_ports(execlists)); + + overload = *execlists->overload; /* active->idle? */ } } while (head != tail); + WRITE_ONCE(execlists->overload, overload && *execlists->overload); which sets the overload flag if we enter with an active context and leave with another active context, without going through an idle state. But that will set overload for timeslicing and for high priority preemption. (Now I have to ask whether that's what you had before :) > You might argue that this will introduce a delay in the signalling of > overload roughly equal to the latency it takes for the CPU to receive > the execlists IRQ with the hardware ACK. However that seems beneficial > since the clearing of overload suffers from the same latency, so the > fraction of time that overload is signalled will otherwise be biased as > a result of the latency difference, causing overload to be overreported > on the average. Delaying the signalling of overload to the CSB handler > means that any systematic latency in our interrupt processing is > self-correcting. That only worries me if we are trying to balance decisions made on submission with the async ack. If we can solely use the CSB events that worry is moot. > Anyway, I'm open to other suggestions if you have other ideas that at > least don't worsen the pre-existing regression from my series. Likewise, if the best is the current overload semantics then so be it. -Chris --------------------------------------------------------------------- Intel Corporation (UK) Limited Registered No. 1134945 (England) Registered Office: Pipers Way, Swindon SN3 1RJ VAT No: 860 2173 47 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx