Hi Andrzej and Chris, On Wed, Sep 21, 2022 at 03:52:58PM +0200, Andrzej Hajda wrote: > From: Chris Wilson <chris.p.wilson@xxxxxxxxx> > > When we submit a new pair of contexts to ELSP for execution, we start a > timer by which point we expect the HW to have switched execution to the > pending contexts. If the promotion to the new pair of contexts has not > occurred, we declare the executing context to have hung and force the > preemption to take place by resetting the engine and resubmitting the > new contexts. > > This can lead to an unfair situation where almost all of the preemption > timeout is consumed by the first context which just switches into the > second context immediately prior to the timer firing and triggering the > preemption reset (assuming that the timer interrupts before we process > the CS events for the context switch). The second context hasn't yet had > a chance to yield to the incoming ELSP (and send the ACk for the > promotion) and so ends up being blamed for the reset. > > If we see that a context switch has occurred since setting the > preemption timeout, but have not yet received the ACK for the ELSP > promotion, rearm the preemption timer and check again. This is > especially significant if the first context was not schedulable and so > we used the shortest timer possible, greatly increasing the chance of > accidentally blaming the second innocent context. > > Fixes: 3a7a92aba8fb ("drm/i915/execlists: Force preemption") > Fixes: d12acee84ffb ("drm/i915/execlists: Cancel banned contexts on schedule-out") > Reported-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > Cc: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx> > Reviewed-by: Andrzej Hajda <andrzej.hajda@xxxxxxxxx> > Tested-by: Andrzej Hajda <andrzej.hajda@xxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> # v5.5+ > --- > Hi, > > This patch is upstreamed from internal branch. So I have removed > R-B by Andi. Andi let me know if your R-B still apply. yes, I know this patch and my r-b holds: Reviewed-by: Andi Shyti <andi.shyti@xxxxxxxxxxxxxxx> Anyway, thanks Chris for the comments and the clear explanation both in the commit log and in between the code. Andi > Regards > Andrzej > --- > drivers/gpu/drm/i915/gt/intel_engine_types.h | 15 +++++++++++++ > .../drm/i915/gt/intel_execlists_submission.c | 21 ++++++++++++++++++- > 2 files changed, 35 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h > index 633a7e5dba3b4b..6b5d4ea22b673b 100644 > --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h > +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h > @@ -165,6 +165,21 @@ struct intel_engine_execlists { > */ > struct timer_list preempt; > > + /** > + * @preempt_target: active request at the time of the preemption request > + * > + * We force a preemption to occur if the pending contexts have not > + * been promoted to active upon receipt of the CS ack event within > + * the timeout. This timeout maybe chosen based on the target, > + * using a very short timeout if the context is no longer schedulable. > + * That short timeout may not be applicable to other contexts, so > + * if a context switch should happen within before the preemption > + * timeout, we may shoot early at an innocent context. To prevent this, > + * we record which context was active at the time of the preemption > + * request and only reset that context upon the timeout. > + */ > + const struct i915_request *preempt_target; > + > /** > * @ccid: identifier for contexts submitted to this engine > */ > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > index 4b909cb88cdfb7..c718e6dc40b515 100644 > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > @@ -1241,6 +1241,9 @@ static unsigned long active_preempt_timeout(struct intel_engine_cs *engine, > if (!rq) > return 0; > > + /* Only allow ourselves to force reset the currently active context */ > + engine->execlists.preempt_target = rq; > + > /* Force a fast reset for terminated contexts (ignoring sysfs!) */ > if (unlikely(intel_context_is_banned(rq->context) || bad_request(rq))) > return INTEL_CONTEXT_BANNED_PREEMPT_TIMEOUT_MS; > @@ -2427,8 +2430,24 @@ static void execlists_submission_tasklet(struct tasklet_struct *t) > GEM_BUG_ON(inactive - post > ARRAY_SIZE(post)); > > if (unlikely(preempt_timeout(engine))) { > + const struct i915_request *rq = *engine->execlists.active; > + > + /* > + * If after the preempt-timeout expired, we are still on the > + * same active request/context as before we initiated the > + * preemption, reset the engine. > + * > + * However, if we have processed a CS event to switch contexts, > + * but not yet processed the CS event for the pending > + * preemption, reset the timer allowing the new context to > + * gracefully exit. > + */ > cancel_timer(&engine->execlists.preempt); > - engine->execlists.error_interrupt |= ERROR_PREEMPT; > + if (rq == engine->execlists.preempt_target) > + engine->execlists.error_interrupt |= ERROR_PREEMPT; > + else > + set_timer_ms(&engine->execlists.preempt, > + active_preempt_timeout(engine, rq)); > } > > if (unlikely(READ_ONCE(engine->execlists.error_interrupt))) { > -- > 2.34.1