Quoting Tvrtko Ursulin (2018-06-27 11:40:32) > > On 25/06/2018 10:48, Chris Wilson wrote: > > Back in commit 27af5eea54d1 ("drm/i915: Move execlists irq handler to a > > bottom half"), we came to the conclusion that running our CSB processing > > and ELSP submission from inside the irq handler was a bad idea. A really > > bad idea as we could impose nearly 1s latency on other users of the > > system, on average! Deferring our work to a tasklet allowed us to do the > > processing with irqs enabled, reducing the impact to an average of about > > 50us. > > > > We have since eradicated the use of forcewaked mmio from inside the CSB > > processing and ELSP submission, bringing the impact down to around 5us > > (on Kabylake); an order of magnitude better than our measurements 2 > > years ago on Broadwell and only about 2x worse on average than the > > gem_syslatency on an unladen system. > > > > In this iteration of the tasklet-vs-direct submission debate, we seek a > > compromise where by we submit new requests immediately to the HW but > > defer processing the CS interrupt onto a tasklet. We gain the advantage > > of low-latency and ksoftirqd avoidance when waking up the HW, while > > avoiding the system-wide starvation of our CS irq-storms. > > > > Comparing the impact on the maximum latency observed (that is the time > > stolen from an RT process) over a 120s interval, repeated several times > > (using gem_syslatency, similar to RT's cyclictest) while the system is > > fully laden with i915 nops, we see that direct submission an actually > > improve the worse case. > > > > Maximum latency in microseconds of a third party RT thread > > (gem_syslatency -t 120 -f 2) > > x Always using tasklets (a couple of >1000us outliers removed) > > + Only using tasklets from CS irq, direct submission of requests > > +------------------------------------------------------------------------+ > > | + | > > | + | > > | + | > > | + + | > > | + + + | > > | + + + + x x x | > > | +++ + + + x x x x x x | > > | +++ + ++ + + *x x x x x x | > > | +++ + ++ + * *x x * x x x | > > | + +++ + ++ * * +*xxx * x x xx | > > | * +++ + ++++* *x+**xx+ * x x xxxx x | > > | **x++++*++**+*x*x****x+ * +x xx xxxx x x | > > |x* ******+***************++*+***xxxxxx* xx*x xxx + x+| > > | |__________MA___________| | > > | |______M__A________| | > > +------------------------------------------------------------------------+ > > N Min Max Median Avg Stddev > > x 118 91 186 124 125.28814 16.279137 > > + 120 92 187 109 112.00833 13.458617 > > Difference at 95.0% confidence > > -13.2798 +/- 3.79219 > > -10.5994% +/- 3.02677% > > (Student's t, pooled s = 14.9237) > > > > However the mean latency is adversely affected: > > > > Mean latency in microseconds of a third party RT thread > > (gem_syslatency -t 120 -f 1) > > x Always using tasklets > > + Only using tasklets from CS irq, direct submission of requests > > +------------------------------------------------------------------------+ > > | xxxxxx + ++ | > > | xxxxxx + ++ | > > | xxxxxx + +++ ++ | > > | xxxxxxx +++++ ++ | > > | xxxxxxx +++++ ++ | > > | xxxxxxx +++++ +++ | > > | xxxxxxx + ++++++++++ | > > | xxxxxxxx ++ ++++++++++ | > > | xxxxxxxx ++ ++++++++++ | > > | xxxxxxxxxx +++++++++++++++ | > > | xxxxxxxxxxx x +++++++++++++++ | > > |x xxxxxxxxxxxxx x + + ++++++++++++++++++ +| > > | |__A__| | > > | |____A___| | > > +------------------------------------------------------------------------+ > > N Min Max Median Avg Stddev > > x 120 3.506 3.727 3.631 3.6321417 0.02773109 > > + 120 3.834 4.149 4.039 4.0375167 0.041221676 > > Difference at 95.0% confidence > > 0.405375 +/- 0.00888913 > > 11.1608% +/- 0.244735% > > (Student's t, pooled s = 0.03513) > > > > However, since the mean latency corresponds to the amount of irqsoff > > processing we have to do for a CS interrupt, we only need to speed that > > up to benefit not just system latency but our own throughput. > > > > v2: Remember to defer submissions when under reset. > > v4: Only use direct submission for new requests > > v5: Be aware that with mixing direct tasklet evaluation and deferred > > tasklets, we may end up idling before running the deferred tasklet. > > > > Testcase: igt/gem_exec_latency/*rthog* > > References: 27af5eea54d1 ("drm/i915: Move execlists irq handler to a bottom half") > > Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > > --- > > drivers/gpu/drm/i915/i915_gem.h | 5 + > > drivers/gpu/drm/i915/i915_irq.c | 11 +- > > drivers/gpu/drm/i915/intel_engine_cs.c | 8 +- > > drivers/gpu/drm/i915/intel_lrc.c | 147 ++++++++++++++---------- > > drivers/gpu/drm/i915/intel_ringbuffer.h | 1 - > > 5 files changed, 98 insertions(+), 74 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h > > index 261da577829a..7892ac773916 100644 > > --- a/drivers/gpu/drm/i915/i915_gem.h > > +++ b/drivers/gpu/drm/i915/i915_gem.h > > @@ -88,4 +88,9 @@ static inline void __tasklet_enable_sync_once(struct tasklet_struct *t) > > tasklet_kill(t); > > } > > > > +static inline bool __tasklet_is_enabled(const struct tasklet_struct *t) > > +{ > > + return likely(!atomic_read(&t->count)); > > +} > > + > > For the unlikely-likely chain from > __submit_queue->reset_in_progress->__tasklet_is_enabled I think it would > be better to drop the likely/unlikely from low-level helpers and put the > one unlikely into the __submit_queue. Tasklets are rarely disabled, I think that's quite important to stress. Tasklets do not function very well (heavy spinning) while disabled. > > #endif /* __I915_GEM_H__ */ > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > > index 46aaef5c1851..316d0b08d40f 100644 > > --- a/drivers/gpu/drm/i915/i915_irq.c > > +++ b/drivers/gpu/drm/i915/i915_irq.c > > @@ -1469,14 +1469,10 @@ static void snb_gt_irq_handler(struct drm_i915_private *dev_priv, > > static void > > gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir) > > { > > - struct intel_engine_execlists * const execlists = &engine->execlists; > > bool tasklet = false; > > > > - if (iir & GT_CONTEXT_SWITCH_INTERRUPT) { > > - if (READ_ONCE(engine->execlists.active)) > > What is the thinking behind this change? It used to be that we scheduled > the tasklet only when we knew we are expecting interrupts and now we > don't care any more for some reason? The filtering is done inside process_csb(). We filtered on active previously as some interrupts were seemingly going astray, now I am much more confident that all are accounted for. > > - tasklet = !test_and_set_bit(ENGINE_IRQ_EXECLIST, > > - &engine->irq_posted); > > And this is gone as well. Can you put a paragraph in the commit message > explaining the change? It doesn't seem immediately connected with direct > submission. Removing one heavyweight atomic operation in the latency sensitive interrupt. > > + if (iir & GT_CONTEXT_SWITCH_INTERRUPT) > > + tasklet = true; > > > > if (iir & GT_RENDER_USER_INTERRUPT) { > > notify_ring(engine); > > @@ -1484,7 +1480,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir) > > } > > > > if (tasklet) > > - tasklet_hi_schedule(&execlists->tasklet); > > + tasklet_hi_schedule(&engine->execlists.tasklet); > > } > > > > static void gen8_gt_irq_ack(struct drm_i915_private *i915, > > @@ -2216,7 +2212,6 @@ static irqreturn_t cherryview_irq_handler(int irq, void *arg) > > > > I915_WRITE(VLV_IER, ier); > > I915_WRITE(GEN8_MASTER_IRQ, GEN8_MASTER_IRQ_CONTROL); > > - POSTING_READ(GEN8_MASTER_IRQ); > > What is this? Something that I haven't managed to kill yet. > > diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c > > index 7209c22798e6..ace93958689e 100644 > > --- a/drivers/gpu/drm/i915/intel_engine_cs.c > > +++ b/drivers/gpu/drm/i915/intel_engine_cs.c > > @@ -1353,12 +1353,10 @@ static void intel_engine_print_registers(const struct intel_engine_cs *engine, > > ptr = I915_READ(RING_CONTEXT_STATUS_PTR(engine)); > > read = GEN8_CSB_READ_PTR(ptr); > > write = GEN8_CSB_WRITE_PTR(ptr); > > - drm_printf(m, " Execlist CSB read %d [%d cached], write %d [%d from hws], interrupt posted? %s, tasklet queued? %s (%s)\n", > > + drm_printf(m, " Execlist CSB read %d [%d cached], write %d [%d from hws], tasklet queued? %s (%s)\n", > > read, execlists->csb_head, > > write, > > intel_read_status_page(engine, intel_hws_csb_write_index(engine->i915)), > > - yesno(test_bit(ENGINE_IRQ_EXECLIST, > > - &engine->irq_posted)), > > yesno(test_bit(TASKLET_STATE_SCHED, > > &engine->execlists.tasklet.state)), > > enableddisabled(!atomic_read(&engine->execlists.tasklet.count))); > > @@ -1570,11 +1568,9 @@ void intel_engine_dump(struct intel_engine_cs *engine, > > spin_unlock(&b->rb_lock); > > local_irq_restore(flags); > > > > - drm_printf(m, "IRQ? 0x%lx (breadcrumbs? %s) (execlists? %s)\n", > > + drm_printf(m, "IRQ? 0x%lx (breadcrumbs? %s)\n", > > engine->irq_posted, > > yesno(test_bit(ENGINE_IRQ_BREADCRUMB, > > - &engine->irq_posted)), > > - yesno(test_bit(ENGINE_IRQ_EXECLIST, > > &engine->irq_posted))); > > > > drm_printf(m, "HWSP:\n"); > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > > index 5a12b8fc9d8f..c82efa3ac105 100644 > > --- a/drivers/gpu/drm/i915/intel_lrc.c > > +++ b/drivers/gpu/drm/i915/intel_lrc.c > > @@ -562,13 +562,15 @@ static void complete_preempt_context(struct intel_engine_execlists *execlists) > > { > > GEM_BUG_ON(!execlists_is_active(execlists, EXECLISTS_ACTIVE_PREEMPT)); > > > > + __unwind_incomplete_requests(container_of(execlists, > > + typeof(struct intel_engine_cs), > > + execlists)); > > execlists_cancel_port_requests(execlists); > > - execlists_unwind_incomplete_requests(execlists); > > Is the ordering change significant and why? Mostly for consistency and reasoning about request reference lifetimes. (Unwind => we retain the request reference, as it is moved back to the protected execution lists.) > > +static void execlists_submission_tasklet(unsigned long data) > > +{ > > + struct intel_engine_cs * const engine = (struct intel_engine_cs *)data; > > + unsigned long flags; > > + > > + GEM_TRACE("%s awake?=%d, active=%x\n", > > + engine->name, > > + engine->i915->gt.awake, > > + engine->execlists.active); > > + > > + spin_lock_irqsave(&engine->timeline.lock, flags); > > + > > + if (engine->i915->gt.awake) /* we may be delayed until after we idle! */ > > + __execlists_submission_tasklet(engine); > > Sounds quite bad! this means we fail to process pending CSB. And going > idle syncs the tasklets so what am I missing? That tasklets get kicked randomly, I think was the culprit. > > +static void __submit_queue(struct intel_engine_cs *engine) > > +{ > > + struct intel_engine_execlists * const execlists = &engine->execlists; > > + > > + if (reset_in_progress(execlists)) > > + return; /* defer until we restart the engine following reset */ > > + > > + if (execlists->tasklet.func == execlists_submission_tasklet) > > What is this check determining? That we can call __execlists_submission_tasklet, i.e. it is not guc, veng or anything weirder. > Are we always calling it directly even if the ports are busy? Wouldn't > it be better to schedule in that that case? No, we are only calling it if we have a more important request than either port (see queue_priority). > > +static void reset_csb_pointers(struct intel_engine_execlists *execlists) > > +{ > > + /* > > + * After a reset, the HW starts writing into CSB entry [0]. We > > + * therefore have to set our HEAD pointer back one entry so that > > + * the *first* entry we check is entry 0. To complicate this further, > > + * as we don't wait for the first interrupt after reset, we have to > > + * fake the HW write to point back to the last entry so that our > > + * inline comparison of our cached head position against the last HW > > + * write works even before the first interrupt. > > + */ > > + execlists->csb_head = GEN8_CSB_ENTRIES - 1; > > + WRITE_ONCE(*execlists->csb_write, (GEN8_CSB_ENTRIES - 1) | 0xff << 16); > > +} > > Hmm this makes me think there should be another prep patch before direct > submission. Need to build a clearer picture before I can suggest how. For what? This was already in the prep patches (handing CSB on reset vs resume paths, and all the fake fallout), I don't actually need it in a function, it was just handy to do so as iirc I wanted to use it elsewhere, but fortunately killed off that caller. So the prep patch is just making it into a function. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx