On Wed, 2023-12-13 at 16:23 -0500, Vivi, Rodrigo wrote: > On Tue, Dec 12, 2023 at 08:57:16AM -0800, Alan Previn wrote: > > If we are at the end of suspend or very early in resume > > its possible an async fence signal (via rcu_call) is triggered > > to free_engines which could lead us to the execution of > > the context destruction worker (after a prior worker flush). alan:snip > > > Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_- > > contexts if guc_lrc_desc_unpin fails due to CT send falure. > > When unrolling, keep the context in the GuC's destroy-list so > > it can get picked up on the next destroy worker invocation > > (if suspend aborted) or get fully purged as part of a GuC > > sanitization (end of suspend) or a reset flow. > > > > Signed-off-by: Alan Previn <alan.previn.teres.alexis@xxxxxxxxx> > > Signed-off-by: Anshuman Gupta <anshuman.gupta@xxxxxxxxx> > > Tested-by: Mousumi Jana <mousumi.jana@xxxxxxxxx> > > Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@xxxxxxxxx> > > Thanks for all the explanations, patience and great work! > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> alan: Thanks Rodrigo for the RB last week, just quick update: I've cant reproduce the BAT failures that seem to be intermittent on platform and test - however, a noticable number of failures do keep occuring on i915_selftest @live @requests where the last test leaked a wakeref and the failing test hangs waiting for gt to idle before starting its test. i have to debug this further although from code inspection is unrelated to the patches in this series. Hopefully its a different issue.