Re: [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis

"Teres Alexis, Alan Previn" <alan.previn.teres.alexis@xxxxxxxxx> · Wed, 21 Sep 2022 22:30:29 +0000

On Fri, 2022-09-16 at 09:58 +0100, Tvrtko Ursulin wrote:
> On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote:
> > On Thu, 2022-09-15 at 09:48 +0100, Tvrtko Ursulin wrote:
> > > On 15/09/2022 03:12, Alan Previn wrote:
> > > > From: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > > 
> > > > +static void guc_flush_all_delayed_disable_sched_contexts(struct intel_guc *guc)
> > > > +{
> > > > +	struct intel_context *ce;
> > > > +	unsigned long index;
> > > > +	unsigned long flags;
> > > > +	unsigned long ceflags;
> > > >    
> > > > -	with_intel_runtime_pm(runtime_pm, wakeref)
> > > > -		__guc_context_sched_disable(guc, ce, guc_id);
> > > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > > > +	xa_for_each(&guc->context_lookup, index, ce) {
> > > > +		if (!kref_get_unless_zero(&ce->ref))
> > > > +			continue;
> > > > +		xa_unlock(&guc->context_lookup);
> > > 
> > > So this whole loop _needs_ to run with interrupts disabled? Explaining
> > > why in a comment would be good.
> > > 
> > Being mid-reset, the locking mode is consistent with other functions also used
> > as part of the reset preparation that parses and potentially modifies contexts.
> > I believe the goal is to handle all of this parsing without getting conflicting
> > latent G2H replies that breaks the preparation flow (that herds active contexts
> > into a fewer set of states as part of reset) - but i will double check
> > with my colleagues.
> > 
Alan: Update i realize the guc-reset-preparation code disable irqs for the guc being reset
prior to this function so in fact, we really ought not to see any change to that xa_array
because of a context-id change that is coming via a interrupt that is orthogonal to this
thread. I will change to xa_lock.

> > > > +		if (test_bit(CONTEXT_GUC_INIT, &ce->flags) &&
> > > > +		    cancel_delayed_work(&ce->guc_state.sched_disable_delay)) {
> > > > +			spin_lock_irqsave(&ce->guc_state.lock, ceflags);
> > > > +			spin_unlock_irqrestore(&ce->guc_state.lock, ceflags);
> > > 
> > > This deserves a comment about what lock toggling wants to ensure.
> > > 
> > I realize this might have been my local rebasing mistake, the intention was to
> > handle cases where sched_disable_delay wasn't pending but potentially still
> > executing do_sched_disable. I believe I could try cancel_delayed_work_sync (but
> > not sure if i can call that might-sleep funtion mid reset while not-
> > interruptible). Else, i would move that lock-unlock to if the
> > cancel_delayed_work did not return true (as per original intent before my
> > rebase error).
> 
> Okay a comment like "flush any currently executing do_sched_disable" 
> above the lock toggling would do then. Even better if you add what 
> happens if that flushing isn't done.
> 
> > > Also, if the loops runs with interrupts disabled what is the point of
> > > irqsave variant in here??
> > Yes - its redundant, let me fix that, apologies for that.
> > 
same thing here, a context's guc state should not change in response to an IRQ
from that guc since we disabled it while we are in reset preparatoin
- so actually "not needed" as opposed to "redundant"

> > > Also2, what is the reason for dropping the lock? intel_context_put?
> > Being consistent with other reset preparation code that closes contexts,
> > the lock is dropped before the intel_context_put.
> > (I hope i am not misunderstanding your question).
> 
> Right, okay.. so for locking inversion issues - intel_context_put must 
> not be called with guc_state.lock held, yes?
> 
> Not a mandatory request, but if you want you could look into the option 
> of avoiding lock dropping and instead doing xa_erase and recording the 
> list of contexts for sched disable or put in a local list, and doing 
> them all in one block outside the locked/irq disabled section. Although 
> tbh I am not sure if that would improve anything. Probably not in this 
> case of a reset path, but if there are other places in GuC code which do 
> this it may be beneficial for less hammering on the CPU and core 
> synchronisation for atomics.
> 
apologies my ignorance - perhaps i misunderstood how these functions work but
I assumed that calling kref_get_unless_zero with a non-zero return that lead us
here meant that there is still a ref on the context that didnt come from the
reset path so i am just following the correct rules to release that ref 
and not destroy the contexts yet - but leaving it in the pending-disable
that will be handled in scrub_guc_desc_for_outstanding_g2h that gets called
later in the reset preparation.

Actually i realize that the better option might be to move above code
into the loop already present inside of scrub_guc_desc_for_outstanding_g2h
just prior to its checking of whether a context is pending-disable.
That would ensure we take all these context locks once in the same function
for the herding of all possible states when scrubbing those outstanding g2h.