Re: [PATCH v3 2/3] drm/i915/guc: Close deregister-context race against CT-loss

"Gupta, Anshuman" <anshuman.gupta@xxxxxxxxx> · Sat, 23 Sep 2023 04:00:21 +0000



> -----Original Message-----
> From: Teres Alexis, Alan Previn <alan.previn.teres.alexis@xxxxxxxxx>
> Sent: Friday, September 22, 2023 11:32 PM
> To: Vivi, Rodrigo <rodrigo.vivi@xxxxxxxxx>
> Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Gupta, Anshuman
> <anshuman.gupta@xxxxxxxxx>
> Subject: Re: [PATCH v3 2/3] drm/i915/guc: Close deregister-context race
> against CT-loss
> 
> (cc Anshuman who is working directly with the taskforce debugging this)
> 
> Thanks again for taking the time to review this patch.
> Apologies for the tardiness, rest assured debug is still ongoing.
> 
> As mentioned in prior comments, the signatures and frequency are now
> different compared to without the patches of this series.
> We are still hunting for data as we are suspecting a different wakeref still being
> held with the same trigger event despite.
> 
> That said, we will continue to rebase / update this series but hold off on actual
> merge until we can be sure we have all the issues resolved.
> 
> On Thu, 2023-09-14 at 11:34 -0400, Vivi, Rodrigo wrote:
> > On Sat, Sep 09, 2023 at 08:58:45PM -0700, Alan Previn wrote:
> > > If we are at the end of suspend or very early in resume its possible
> > > an async fence signal could lead us to the
> alan:snip
> 
> 
> > > @@ -3188,19 +3202,33 @@ static inline void guc_lrc_desc_unpin(struct
> intel_context *ce)
> > >  	/* Seal race with Reset */
> > >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> > >  	disabled = submission_disabled(guc);
> > >
> alan:snip
> > > +	/* Change context state to destroyed and get gt-pm */
> > > +	__intel_gt_pm_get(gt);
> > > +	set_context_destroyed(ce);
> > > +	clr_context_registered(ce);
> > > +
> > > +	ret = deregister_context(ce, ce->guc_id.id);
> > > +	if (ret) {
> > > +		/* Undo the state change and put gt-pm if that failed */
> > > +		set_context_registered(ce);
> > > +		clr_context_destroyed(ce);
> > > +		intel_gt_pm_put(gt);
> >
> > This is a might_sleep inside a spin_lock! Are you 100% confident no
> > WARN was seeing during the tests indicated in the commit msg?
> alan: Good catch - i dont think we saw a WARN - I'll go back and check with the
> task force - i shall rework this function to get that outside the lock.
> 
> >
> > > +	}
> > > +	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> > > +
> > > +	return 0;
> >
> > If you are always returning 0, there's no pointing in s/void/int...
> Alan: agreed - will change to void.
> > >
> > >
> 
> alan:snip
> > > @@ -3279,6 +3322,17 @@ static void destroyed_worker_func(struct
> work_struct *w)
> > >  	struct intel_gt *gt = guc_to_gt(guc);
> > >  	int tmp;
> > >
> > > +	/*
> > > +	 * In rare cases we can get here via async context-free fence-signals
> that
> > > +	 * come very late in suspend flow or very early in resume flows. In
> these
> > > +	 * cases, GuC won't be ready but just skipping it here is fine as these
> > > +	 * pending-destroy-contexts get destroyed totally at GuC reset time at
> the
> > > +	 * end of suspend.. OR.. this worker can be picked up later on the next
> > > +	 * context destruction trigger after resume-completes
> >
> > who is triggering the work queue again?
> 
> alan: short answer: we dont know - and still hunting this (getting closer now..
> using task tgid str-name lookups).
> in the few times I've seen it, the callstack I've seen looked like this:
> 
> [33763.582036] Call Trace:
> [33763.582038]  <TASK>
> [33763.582040]  dump_stack_lvl+0x69/0x97 [33763.582054]
> guc_context_destroy+0x1b5/0x1ec [33763.582067]
> free_engines+0x52/0x70 [33763.582072]  rcu_do_batch+0x161/0x438
> [33763.582084]  rcu_nocb_cb_kthread+0xda/0x2d0 [33763.582093]
> kthread+0x13a/0x152 [33763.582102]  ?
> rcu_nocb_gp_kthread+0x6a7/0x6a7 [33763.582107]  ? css_get+0x38/0x38
> [33763.582118]  ret_from_fork+0x1f/0x30 [33763.582128]  </TASK>
Alan above trace is not due to missing GT wakeref, it is due to a intel_context_put(),
Which  called asynchronously by rcu_call(__free_engines), we need insert rcu_barrier() to flush all
rcu callback in late suspend.

Thanks,
Anshuman.
> 
> I did add additional debug-msg for tracking and I recall seeing this sequence via
> independant callstacks in the big picture:
> 	i915_sw_fence_complete > __i915_sw_fence_complete ->
> __i915_sw_fence_notify(fence, FENCE_FREE) -> <..delayed?..>
> 	[ drm fence sync func ] <...> engines_notify > call_rcu(&engines>rcu,
> free_engines_rcu) <..delayed?..>
> 	free_engines -> intel_context_put -> ... [kref-dec] -->
> guc_context_destroy
> 
> Unfortunately, we still don't know why this initial "i915_sw_fence_complete"
> is coming during suspend-late.
> NOTE1: in the cover letter or prior comment, I hope i mentioned the
> reproduction steps where it only occurs when having a workload that does
> network download that begins downloading just before suspend is started
> but completes before suspend late. We are getting close to finding this - taking
> time because of the reproduction steps.
> 
> Anshuman can chime in if he is seeing new signatures with different callstack /
> events after this patch is applied.
> 
> 
> > I mean, I'm wondering if it is necessary to re-schedule it from inside...
> > but also wonder if this is safe anyway...
> 
> alan: so my thought process, also after consulting with John and Daniele, was:
> since we have a link list to collect the list of contexts that need to be
> dereigstered, and since we have up to 64k (32k excluding multi-lrc) guc-ids,
> there really is risk is just keeping it inside the link list until we reach one of the
> following:
> 
> 1. full suspend happens and we actually reset the guc - in which case we will
> remove
>    all contexts in this link list and completely destroy them and release all
> references
>    immediately without calling any h2g. (that cleanup will occurs nearly the end
> of
>    gem's suspend late, after which this worker will flush and if it had pending
> contexts
>    would have bailed with !intel_guc_is_ready.
> 
> 2. suspend is aborted so we come back into the resume steps and we
> eventually, at some point
>    get another request completion that signals a context freeing that we end up
> retriggering
>    this worker in which we'll find two contexts ready to be deregistered.
> 
> 
> >
> > > +	 */
> > > +	if (!intel_guc_is_ready(guc))
> > > +		return;
> > > +
> > >  	with_intel_gt_pm(gt, tmp)
> > >  		deregister_destroyed_contexts(guc);
> > >  }
> > > --
> > > 2.39.0
> > >