On Fri, 2022-09-16 at 09:58 +0100, Tvrtko Ursulin wrote: > On 16/09/2022 08:53, Teres Alexis, Alan Previn wrote: > > On Thu, 2022-09-15 at 09:48 +0100, Tvrtko Ursulin wrote: > > > On 15/09/2022 03:12, Alan Previn wrote: > > > > From: Matthew Brost <matthew.brost@xxxxxxxxx> > > > > > > > > +static void guc_flush_all_delayed_disable_sched_contexts(struct intel_guc *guc) > > > > +{ > > > > + struct intel_context *ce; > > > > + unsigned long index; > > > > + unsigned long flags; > > > > + unsigned long ceflags; > > > > > > > > - with_intel_runtime_pm(runtime_pm, wakeref) > > > > - __guc_context_sched_disable(guc, ce, guc_id); > > > > + xa_lock_irqsave(&guc->context_lookup, flags); > > > > + xa_for_each(&guc->context_lookup, index, ce) { > > > > + if (!kref_get_unless_zero(&ce->ref)) > > > > + continue; > > > > + xa_unlock(&guc->context_lookup); > > > > > > So this whole loop _needs_ to run with interrupts disabled? Explaining > > > why in a comment would be good. > > > > > Being mid-reset, the locking mode is consistent with other functions also used > > as part of the reset preparation that parses and potentially modifies contexts. > > I believe the goal is to handle all of this parsing without getting conflicting > > latent G2H replies that breaks the preparation flow (that herds active contexts > > into a fewer set of states as part of reset) - but i will double check > > with my colleagues. > > Alan: Update i realize the guc-reset-preparation code disable irqs for the guc being reset prior to this function so in fact, we really ought not to see any change to that xa_array because of a context-id change that is coming via a interrupt that is orthogonal to this thread. I will change to xa_lock. > > > > + if (test_bit(CONTEXT_GUC_INIT, &ce->flags) && > > > > + cancel_delayed_work(&ce->guc_state.sched_disable_delay)) { > > > > + spin_lock_irqsave(&ce->guc_state.lock, ceflags); > > > > + spin_unlock_irqrestore(&ce->guc_state.lock, ceflags); > > > > > > This deserves a comment about what lock toggling wants to ensure. > > > > > I realize this might have been my local rebasing mistake, the intention was to > > handle cases where sched_disable_delay wasn't pending but potentially still > > executing do_sched_disable. I believe I could try cancel_delayed_work_sync (but > > not sure if i can call that might-sleep funtion mid reset while not- > > interruptible). Else, i would move that lock-unlock to if the > > cancel_delayed_work did not return true (as per original intent before my > > rebase error). > > Okay a comment like "flush any currently executing do_sched_disable" > above the lock toggling would do then. Even better if you add what > happens if that flushing isn't done. > > > > Also, if the loops runs with interrupts disabled what is the point of > > > irqsave variant in here?? > > Yes - its redundant, let me fix that, apologies for that. > > same thing here, a context's guc state should not change in response to an IRQ from that guc since we disabled it while we are in reset preparatoin - so actually "not needed" as opposed to "redundant" > > > Also2, what is the reason for dropping the lock? intel_context_put? > > Being consistent with other reset preparation code that closes contexts, > > the lock is dropped before the intel_context_put. > > (I hope i am not misunderstanding your question). > > Right, okay.. so for locking inversion issues - intel_context_put must > not be called with guc_state.lock held, yes? > > Not a mandatory request, but if you want you could look into the option > of avoiding lock dropping and instead doing xa_erase and recording the > list of contexts for sched disable or put in a local list, and doing > them all in one block outside the locked/irq disabled section. Although > tbh I am not sure if that would improve anything. Probably not in this > case of a reset path, but if there are other places in GuC code which do > this it may be beneficial for less hammering on the CPU and core > synchronisation for atomics. > apologies my ignorance - perhaps i misunderstood how these functions work but I assumed that calling kref_get_unless_zero with a non-zero return that lead us here meant that there is still a ref on the context that didnt come from the reset path so i am just following the correct rules to release that ref and not destroy the contexts yet - but leaving it in the pending-disable that will be handled in scrub_guc_desc_for_outstanding_g2h that gets called later in the reset preparation. Actually i realize that the better option might be to move above code into the loop already present inside of scrub_guc_desc_for_outstanding_g2h just prior to its checking of whether a context is pending-disable. That would ensure we take all these context locks once in the same function for the herding of all possible states when scrubbing those outstanding g2h.