> -----Original Message----- > From: Teres Alexis, Alan Previn <alan.previn.teres.alexis@xxxxxxxxx> > Sent: Friday, September 22, 2023 11:32 PM > To: Vivi, Rodrigo <rodrigo.vivi@xxxxxxxxx> > Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Gupta, Anshuman > <anshuman.gupta@xxxxxxxxx> > Subject: Re: [PATCH v3 2/3] drm/i915/guc: Close deregister-context race > against CT-loss > > (cc Anshuman who is working directly with the taskforce debugging this) > > Thanks again for taking the time to review this patch. > Apologies for the tardiness, rest assured debug is still ongoing. > > As mentioned in prior comments, the signatures and frequency are now > different compared to without the patches of this series. > We are still hunting for data as we are suspecting a different wakeref still being > held with the same trigger event despite. > > That said, we will continue to rebase / update this series but hold off on actual > merge until we can be sure we have all the issues resolved. > > On Thu, 2023-09-14 at 11:34 -0400, Vivi, Rodrigo wrote: > > On Sat, Sep 09, 2023 at 08:58:45PM -0700, Alan Previn wrote: > > > If we are at the end of suspend or very early in resume its possible > > > an async fence signal could lead us to the > alan:snip > > > > > @@ -3188,19 +3202,33 @@ static inline void guc_lrc_desc_unpin(struct > intel_context *ce) > > > /* Seal race with Reset */ > > > spin_lock_irqsave(&ce->guc_state.lock, flags); > > > disabled = submission_disabled(guc); > > > > alan:snip > > > + /* Change context state to destroyed and get gt-pm */ > > > + __intel_gt_pm_get(gt); > > > + set_context_destroyed(ce); > > > + clr_context_registered(ce); > > > + > > > + ret = deregister_context(ce, ce->guc_id.id); > > > + if (ret) { > > > + /* Undo the state change and put gt-pm if that failed */ > > > + set_context_registered(ce); > > > + clr_context_destroyed(ce); > > > + intel_gt_pm_put(gt); > > > > This is a might_sleep inside a spin_lock! Are you 100% confident no > > WARN was seeing during the tests indicated in the commit msg? > alan: Good catch - i dont think we saw a WARN - I'll go back and check with the > task force - i shall rework this function to get that outside the lock. > > > > > > + } > > > + spin_unlock_irqrestore(&ce->guc_state.lock, flags); > > > + > > > + return 0; > > > > If you are always returning 0, there's no pointing in s/void/int... > Alan: agreed - will change to void. > > > > > > > > alan:snip > > > @@ -3279,6 +3322,17 @@ static void destroyed_worker_func(struct > work_struct *w) > > > struct intel_gt *gt = guc_to_gt(guc); > > > int tmp; > > > > > > + /* > > > + * In rare cases we can get here via async context-free fence-signals > that > > > + * come very late in suspend flow or very early in resume flows. In > these > > > + * cases, GuC won't be ready but just skipping it here is fine as these > > > + * pending-destroy-contexts get destroyed totally at GuC reset time at > the > > > + * end of suspend.. OR.. this worker can be picked up later on the next > > > + * context destruction trigger after resume-completes > > > > who is triggering the work queue again? > > alan: short answer: we dont know - and still hunting this (getting closer now.. > using task tgid str-name lookups). > in the few times I've seen it, the callstack I've seen looked like this: > > [33763.582036] Call Trace: > [33763.582038] <TASK> > [33763.582040] dump_stack_lvl+0x69/0x97 [33763.582054] > guc_context_destroy+0x1b5/0x1ec [33763.582067] > free_engines+0x52/0x70 [33763.582072] rcu_do_batch+0x161/0x438 > [33763.582084] rcu_nocb_cb_kthread+0xda/0x2d0 [33763.582093] > kthread+0x13a/0x152 [33763.582102] ? > rcu_nocb_gp_kthread+0x6a7/0x6a7 [33763.582107] ? css_get+0x38/0x38 > [33763.582118] ret_from_fork+0x1f/0x30 [33763.582128] </TASK> Alan above trace is not due to missing GT wakeref, it is due to a intel_context_put(), Which called asynchronously by rcu_call(__free_engines), we need insert rcu_barrier() to flush all rcu callback in late suspend. Thanks, Anshuman. > > I did add additional debug-msg for tracking and I recall seeing this sequence via > independant callstacks in the big picture: > i915_sw_fence_complete > __i915_sw_fence_complete -> > __i915_sw_fence_notify(fence, FENCE_FREE) -> <..delayed?..> > [ drm fence sync func ] <...> engines_notify > call_rcu(&engines>rcu, > free_engines_rcu) <..delayed?..> > free_engines -> intel_context_put -> ... [kref-dec] --> > guc_context_destroy > > Unfortunately, we still don't know why this initial "i915_sw_fence_complete" > is coming during suspend-late. > NOTE1: in the cover letter or prior comment, I hope i mentioned the > reproduction steps where it only occurs when having a workload that does > network download that begins downloading just before suspend is started > but completes before suspend late. We are getting close to finding this - taking > time because of the reproduction steps. > > Anshuman can chime in if he is seeing new signatures with different callstack / > events after this patch is applied. > > > > I mean, I'm wondering if it is necessary to re-schedule it from inside... > > but also wonder if this is safe anyway... > > alan: so my thought process, also after consulting with John and Daniele, was: > since we have a link list to collect the list of contexts that need to be > dereigstered, and since we have up to 64k (32k excluding multi-lrc) guc-ids, > there really is risk is just keeping it inside the link list until we reach one of the > following: > > 1. full suspend happens and we actually reset the guc - in which case we will > remove > all contexts in this link list and completely destroy them and release all > references > immediately without calling any h2g. (that cleanup will occurs nearly the end > of > gem's suspend late, after which this worker will flush and if it had pending > contexts > would have bailed with !intel_guc_is_ready. > > 2. suspend is aborted so we come back into the resume steps and we > eventually, at some point > get another request completion that signals a context freeing that we end up > retriggering > this worker in which we'll find two contexts ready to be deregistered. > > > > > > > + */ > > > + if (!intel_guc_is_ready(guc)) > > > + return; > > > + > > > with_intel_gt_pm(gt, tmp) > > > deregister_destroyed_contexts(guc); > > > } > > > -- > > > 2.39.0 > > >