On Wed, Nov 24, 2021 at 08:55:39AM -0500, Rodrigo Vivi wrote: > On Wed, Nov 24, 2021 at 08:56:52AM +0000, Tvrtko Ursulin wrote: > > > > On 23/11/2021 19:52, Rodrigo Vivi wrote: > > > On Tue, Nov 23, 2021 at 09:39:25AM +0000, Tvrtko Ursulin wrote: > > > > > > > > On 17/11/2021 22:49, Vinay Belgaumkar wrote: > > > > > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > > > > > > > > > Everytime we come to the end of a virtual engine's context, re-randomise > > > > > it's siblings[]. As we schedule the siblings' tasklets in the order they > > > > > are in the array, earlier entries are executed first (when idle) and so > > > > > will be preferred when scheduling the next virtual request. Currently, > > > > > we only update the array when switching onto a new idle engine, so we > > > > > prefer to stick on the last execute engine, keeping the work compact. > > > > > However, it can be beneficial to spread the work out across idle > > > > > engines, so choose another sibling as our preferred target at the end of > > > > > the context's execution. > > > > > > > > This partially brings back, from a different angle, the more dynamic > > > > scheduling behavior which has been lost since bugfix 90a987205c6c > > > > ("drm/i915/gt: Only swap to a random sibling once upon creation"). > > > > > > Shouldn't we use the Fixes tag here since this is targeting to fix one > > > of the performance regressions of this patch? > > > > Probably not but hard to say. Note that it wasn't a performance regression > > that was reported but power. > > > > And to go back to what we said elsewhere in the thread, I am actually with > > you in thinking that in the ideal world we need PnP testing across a variety > > of workloads and platforms. And "in the ideal world" should really be in the > > normal world. It is not professional to be reactive to isolated bug reports > > from users, without being able to see the overall picture. > > We surely need to address the bug report from users. I'm just asking to address > that with the smallest fix that we can backport and fit to the products milestones. > > Instead, we are creating another optimization feature on a rush. Without a proper > validation. > > I believe it is too risk to add an algorithm like that without a broader test. > I see a big risk of introducing corner cases that will results in more bug report > from other users in a near future. > > So, let's all be professionals and provide a smaller fix for a regression on > the load balancing scenario and provide a better validation with more data > to justify this new feature. Okay, after more IRC discussions I see that patch 2 is also part of the solution and probably safe. Let me be clear that my biggest complain and the risk is with race-to-idle in patch 3 on trying to predict the rc6 behavior and increasing the freq to try to complete job faster and then get to rc6 faster... That one would need a lot more validation. > > Thanks, > Rodrigo. > > > > > > > One day we could experiment with using engine busyness as criteria (instead > > > > of random). Back in the day busyness was kind of the best strategy, although > > > > sampled at submit, not at the trailing edge like here, but it still may be > > > > able to settle down to engine configuration better in some scenarios. Only > > > > testing could say. > > > > > > > > Still, from memory random also wasn't that bad so this should be okay for > > > > now. > > > > > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > > > > > > Since you reviewed and it looks to be a middle ground point in terms > > > of when to balancing (always like in the initial implementation vs > > > only once like the in 90a987205c6c). > > > > > > If this one is really fixing the regression by itself: > > > Acked-by: Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> > > > on this patch here. > > > > > > But I still don't want to take the risk with touching the freq with > > > race to idle, until not convinced that it is absolutely needed and > > > that we are not breaking the world out there. > > > > Yes agreed in principle, we have users with different priorities. > > > > However the RPS patches in the series, definitely the 1st one which looks at > > classes versus individual engines, sound plausible to me. Given the absence > > of automated PnP testing mentioned above, in the past it was usually Chris > > who was making the above and beyond effort to evaluate changes like these on > > as many platforms as he could, and with different workloads. Not sure who > > has the mandate and drive to fill that space but something will need to > > happen. > > > > Regards, > > > > Tvrtko > > > > > > > > > > Regards, > > > > > > > > Tvrtko > > > > > > > > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > > > > Cc: Vinay Belgaumkar <vinay.belgaumkar@xxxxxxxxx> > > > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> > > > > > --- > > > > > .../drm/i915/gt/intel_execlists_submission.c | 80 ++++++++++++------- > > > > > 1 file changed, 52 insertions(+), 28 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > > > > > index ca03880fa7e4..b95bbc8fb91a 100644 > > > > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > > > > > @@ -539,6 +539,41 @@ static void execlists_schedule_in(struct i915_request *rq, int idx) > > > > > GEM_BUG_ON(intel_context_inflight(ce) != rq->engine); > > > > > } > > > > > +static void virtual_xfer_context(struct virtual_engine *ve, > > > > > + struct intel_engine_cs *engine) > > > > > +{ > > > > > + unsigned int n; > > > > > + > > > > > + if (likely(engine == ve->siblings[0])) > > > > > + return; > > > > > + > > > > > + if (!intel_engine_has_relative_mmio(engine)) > > > > > + lrc_update_offsets(&ve->context, engine); > > > > > + > > > > > + /* > > > > > + * Move the bound engine to the top of the list for > > > > > + * future execution. We then kick this tasklet first > > > > > + * before checking others, so that we preferentially > > > > > + * reuse this set of bound registers. > > > > > + */ > > > > > + for (n = 1; n < ve->num_siblings; n++) { > > > > > + if (ve->siblings[n] == engine) { > > > > > + swap(ve->siblings[n], ve->siblings[0]); > > > > > + break; > > > > > + } > > > > > + } > > > > > +} > > > > > + > > > > > +static int ve_random_sibling(struct virtual_engine *ve) > > > > > +{ > > > > > + return prandom_u32_max(ve->num_siblings); > > > > > +} > > > > > + > > > > > +static int ve_random_other_sibling(struct virtual_engine *ve) > > > > > +{ > > > > > + return 1 + prandom_u32_max(ve->num_siblings - 1); > > > > > +} > > > > > + > > > > > static void > > > > > resubmit_virtual_request(struct i915_request *rq, struct virtual_engine *ve) > > > > > { > > > > > @@ -578,8 +613,23 @@ static void kick_siblings(struct i915_request *rq, struct intel_context *ce) > > > > > rq->execution_mask != engine->mask) > > > > > resubmit_virtual_request(rq, ve); > > > > > - if (READ_ONCE(ve->request)) > > > > > + /* > > > > > + * Reschedule with a new "preferred" sibling. > > > > > + * > > > > > + * The tasklets are executed in the order of ve->siblings[], so > > > > > + * siblings[0] receives preferrential treatment of greedily checking > > > > > + * for execution of the virtual engine. At this point, the virtual > > > > > + * engine is no longer in the current GPU cache due to idleness or > > > > > + * contention, so it can be executed on any without penalty. We > > > > > + * re-randomise at this point in order to spread light loads across > > > > > + * the system, heavy overlapping loads will continue to be greedily > > > > > + * executed by the first available engine. > > > > > + */ > > > > > + if (READ_ONCE(ve->request)) { > > > > > + virtual_xfer_context(ve, > > > > > + ve->siblings[ve_random_other_sibling(ve)]); > > > > > tasklet_hi_schedule(&ve->base.sched_engine->tasklet); > > > > > + } > > > > > } > > > > > static void __execlists_schedule_out(struct i915_request * const rq, > > > > > @@ -1030,32 +1080,6 @@ first_virtual_engine(struct intel_engine_cs *engine) > > > > > return NULL; > > > > > } > > > > > -static void virtual_xfer_context(struct virtual_engine *ve, > > > > > - struct intel_engine_cs *engine) > > > > > -{ > > > > > - unsigned int n; > > > > > - > > > > > - if (likely(engine == ve->siblings[0])) > > > > > - return; > > > > > - > > > > > - GEM_BUG_ON(READ_ONCE(ve->context.inflight)); > > > > > - if (!intel_engine_has_relative_mmio(engine)) > > > > > - lrc_update_offsets(&ve->context, engine); > > > > > - > > > > > - /* > > > > > - * Move the bound engine to the top of the list for > > > > > - * future execution. We then kick this tasklet first > > > > > - * before checking others, so that we preferentially > > > > > - * reuse this set of bound registers. > > > > > - */ > > > > > - for (n = 1; n < ve->num_siblings; n++) { > > > > > - if (ve->siblings[n] == engine) { > > > > > - swap(ve->siblings[n], ve->siblings[0]); > > > > > - break; > > > > > - } > > > > > - } > > > > > -} > > > > > - > > > > > static void defer_request(struct i915_request *rq, struct list_head * const pl) > > > > > { > > > > > LIST_HEAD(list); > > > > > @@ -3590,7 +3614,7 @@ static void virtual_engine_initial_hint(struct virtual_engine *ve) > > > > > * NB This does not force us to execute on this engine, it will just > > > > > * typically be the first we inspect for submission. > > > > > */ > > > > > - swp = prandom_u32_max(ve->num_siblings); > > > > > + swp = ve_random_sibling(ve); > > > > > if (swp) > > > > > swap(ve->siblings[swp], ve->siblings[0]); > > > > > } > > > > >