Re: [Intel-gfx] [PATCH 1/3] drm/i915/gt: Spread virtual engines over idle engines

Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> · Wed, 24 Nov 2021 10:09:08 -0500

On Wed, Nov 24, 2021 at 08:55:39AM -0500, Rodrigo Vivi wrote:
> On Wed, Nov 24, 2021 at 08:56:52AM +0000, Tvrtko Ursulin wrote:
> > 
> > On 23/11/2021 19:52, Rodrigo Vivi wrote:
> > > On Tue, Nov 23, 2021 at 09:39:25AM +0000, Tvrtko Ursulin wrote:
> > > > 
> > > > On 17/11/2021 22:49, Vinay Belgaumkar wrote:
> > > > > From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > > > > 
> > > > > Everytime we come to the end of a virtual engine's context, re-randomise
> > > > > it's siblings[]. As we schedule the siblings' tasklets in the order they
> > > > > are in the array, earlier entries are executed first (when idle) and so
> > > > > will be preferred when scheduling the next virtual request. Currently,
> > > > > we only update the array when switching onto a new idle engine, so we
> > > > > prefer to stick on the last execute engine, keeping the work compact.
> > > > > However, it can be beneficial to spread the work out across idle
> > > > > engines, so choose another sibling as our preferred target at the end of
> > > > > the context's execution.
> > > > 
> > > > This partially brings back, from a different angle, the more dynamic
> > > > scheduling behavior which has been lost since bugfix 90a987205c6c
> > > > ("drm/i915/gt: Only swap to a random sibling once upon creation").
> > > 
> > > Shouldn't we use the Fixes tag here since this is targeting to fix one
> > > of the performance regressions of this patch?
> > 
> > Probably not but hard to say. Note that it wasn't a performance regression
> > that was reported but power.
> > 
> > And to go back to what we said elsewhere in the thread, I am actually with
> > you in thinking that in the ideal world we need PnP testing across a variety
> > of workloads and platforms. And "in the ideal world" should really be in the
> > normal world. It is not professional to be reactive to isolated bug reports
> > from users, without being able to see the overall picture.
> 
> We surely need to address the bug report from users. I'm just asking to address
> that with the smallest fix that we can backport and fit to the products milestones.
> 
> Instead, we are creating another optimization feature on a rush. Without a proper
> validation.
> 
> I believe it is too risk to add an algorithm like that without a broader test.
> I see a big risk of introducing corner cases that will results in more bug report
> from other users in a near future.
> 
> So, let's all be professionals and provide a smaller fix for a regression on
> the load balancing scenario and provide a better validation with more data
> to justify this new feature.

Okay, after more IRC discussions I see that patch 2 is also part of the solution
and probably safe.

Let me be clear that my biggest complain and the risk is with race-to-idle in
patch 3 on trying to predict the rc6 behavior and increasing the freq to try to
complete job faster and then get to rc6 faster... That one would need a lot
more validation.

> 
> Thanks,
> Rodrigo.
> 
> > 
> > > > One day we could experiment with using engine busyness as criteria (instead
> > > > of random). Back in the day busyness was kind of the best strategy, although
> > > > sampled at submit, not at the trailing edge like here, but it still may be
> > > > able to settle down to engine configuration better in some scenarios. Only
> > > > testing could say.
> > > > 
> > > > Still, from memory random also wasn't that bad so this should be okay for
> > > > now.
> > > > 
> > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> > > 
> > > Since you reviewed and it looks to be a middle ground point in terms
> > > of when to balancing (always like in the initial implementation vs
> > > only once like the in 90a987205c6c).
> > > 
> > > If this one is really fixing the regression by itself:
> > > Acked-by: Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx>
> > > on this patch here.
> > > 
> > > But I still don't want to take the risk with touching the freq with
> > > race to idle, until not convinced that it is absolutely needed and
> > > that we are not breaking the world out there.
> > 
> > Yes agreed in principle, we have users with different priorities.
> > 
> > However the RPS patches in the series, definitely the 1st one which looks at
> > classes versus individual engines, sound plausible to me. Given the absence
> > of automated PnP testing mentioned above, in the past it was usually Chris
> > who was making the above and beyond effort to evaluate changes like these on
> > as many platforms as he could, and with different workloads. Not sure who
> > has the mandate and drive to fill that space but something will need to
> > happen.
> > 
> > Regards,
> > 
> > Tvrtko
> > 
> > > > 
> > > > Regards,
> > > > 
> > > > Tvrtko
> > > > 
> > > > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > > > > Cc: Vinay Belgaumkar <vinay.belgaumkar@xxxxxxxxx>
> > > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
> > > > > ---
> > > > >    .../drm/i915/gt/intel_execlists_submission.c  | 80 ++++++++++++-------
> > > > >    1 file changed, 52 insertions(+), 28 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > index ca03880fa7e4..b95bbc8fb91a 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > @@ -539,6 +539,41 @@ static void execlists_schedule_in(struct i915_request *rq, int idx)
> > > > >    	GEM_BUG_ON(intel_context_inflight(ce) != rq->engine);
> > > > >    }
> > > > > +static void virtual_xfer_context(struct virtual_engine *ve,
> > > > > +				 struct intel_engine_cs *engine)
> > > > > +{
> > > > > +	unsigned int n;
> > > > > +
> > > > > +	if (likely(engine == ve->siblings[0]))
> > > > > +		return;
> > > > > +
> > > > > +	if (!intel_engine_has_relative_mmio(engine))
> > > > > +		lrc_update_offsets(&ve->context, engine);
> > > > > +
> > > > > +	/*
> > > > > +	 * Move the bound engine to the top of the list for
> > > > > +	 * future execution. We then kick this tasklet first
> > > > > +	 * before checking others, so that we preferentially
> > > > > +	 * reuse this set of bound registers.
> > > > > +	 */
> > > > > +	for (n = 1; n < ve->num_siblings; n++) {
> > > > > +		if (ve->siblings[n] == engine) {
> > > > > +			swap(ve->siblings[n], ve->siblings[0]);
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +static int ve_random_sibling(struct virtual_engine *ve)
> > > > > +{
> > > > > +	return prandom_u32_max(ve->num_siblings);
> > > > > +}
> > > > > +
> > > > > +static int ve_random_other_sibling(struct virtual_engine *ve)
> > > > > +{
> > > > > +	return 1 + prandom_u32_max(ve->num_siblings - 1);
> > > > > +}
> > > > > +
> > > > >    static void
> > > > >    resubmit_virtual_request(struct i915_request *rq, struct virtual_engine *ve)
> > > > >    {
> > > > > @@ -578,8 +613,23 @@ static void kick_siblings(struct i915_request *rq, struct intel_context *ce)
> > > > >    	    rq->execution_mask != engine->mask)
> > > > >    		resubmit_virtual_request(rq, ve);
> > > > > -	if (READ_ONCE(ve->request))
> > > > > +	/*
> > > > > +	 * Reschedule with a new "preferred" sibling.
> > > > > +	 *
> > > > > +	 * The tasklets are executed in the order of ve->siblings[], so
> > > > > +	 * siblings[0] receives preferrential treatment of greedily checking
> > > > > +	 * for execution of the virtual engine. At this point, the virtual
> > > > > +	 * engine is no longer in the current GPU cache due to idleness or
> > > > > +	 * contention, so it can be executed on any without penalty. We
> > > > > +	 * re-randomise at this point in order to spread light loads across
> > > > > +	 * the system, heavy overlapping loads will continue to be greedily
> > > > > +	 * executed by the first available engine.
> > > > > +	 */
> > > > > +	if (READ_ONCE(ve->request)) {
> > > > > +		virtual_xfer_context(ve,
> > > > > +				     ve->siblings[ve_random_other_sibling(ve)]);
> > > > >    		tasklet_hi_schedule(&ve->base.sched_engine->tasklet);
> > > > > +	}
> > > > >    }
> > > > >    static void __execlists_schedule_out(struct i915_request * const rq,
> > > > > @@ -1030,32 +1080,6 @@ first_virtual_engine(struct intel_engine_cs *engine)
> > > > >    	return NULL;
> > > > >    }
> > > > > -static void virtual_xfer_context(struct virtual_engine *ve,
> > > > > -				 struct intel_engine_cs *engine)
> > > > > -{
> > > > > -	unsigned int n;
> > > > > -
> > > > > -	if (likely(engine == ve->siblings[0]))
> > > > > -		return;
> > > > > -
> > > > > -	GEM_BUG_ON(READ_ONCE(ve->context.inflight));
> > > > > -	if (!intel_engine_has_relative_mmio(engine))
> > > > > -		lrc_update_offsets(&ve->context, engine);
> > > > > -
> > > > > -	/*
> > > > > -	 * Move the bound engine to the top of the list for
> > > > > -	 * future execution. We then kick this tasklet first
> > > > > -	 * before checking others, so that we preferentially
> > > > > -	 * reuse this set of bound registers.
> > > > > -	 */
> > > > > -	for (n = 1; n < ve->num_siblings; n++) {
> > > > > -		if (ve->siblings[n] == engine) {
> > > > > -			swap(ve->siblings[n], ve->siblings[0]);
> > > > > -			break;
> > > > > -		}
> > > > > -	}
> > > > > -}
> > > > > -
> > > > >    static void defer_request(struct i915_request *rq, struct list_head * const pl)
> > > > >    {
> > > > >    	LIST_HEAD(list);
> > > > > @@ -3590,7 +3614,7 @@ static void virtual_engine_initial_hint(struct virtual_engine *ve)
> > > > >    	 * NB This does not force us to execute on this engine, it will just
> > > > >    	 * typically be the first we inspect for submission.
> > > > >    	 */
> > > > > -	swp = prandom_u32_max(ve->num_siblings);
> > > > > +	swp = ve_random_sibling(ve);
> > > > >    	if (swp)
> > > > >    		swap(ve->siblings[swp], ve->siblings[0]);
> > > > >    }
> > > > >