On Wed, 2023-08-02 at 16:34 -0700, Teres Alexis, Alan Previn wrote: > This series is the result of debugging issues root caused to > races between the GuC's destroyed_worker_func being triggered vs > repeating suspend-resume cycles with concurrent delayed > fence signals for engine-freeing. > > The reproduction steps require that an app is created right before > the start of the suspend cycle where it creates a new gem > context and submits a tiny workload that would complete in the > middle of the suspend cycle. However this app uses dma-buffer > sharing or dma-fence with non-GPU objects or signals that > eventually triggers a FENCE_FREE via__i915_sw_fence_notify that > connects to engines_notify -> free_engines_rcu -> > intel_context_put -> kref_put(&ce->ref..) that queues the > worker after the GuCs CTB has been disabled (i.e. after > i915-gem's suspend-late flows). > As an FYI - in offline conversations with John and Daniele, we have agreed that at least the first two of the patches in this are necessary improvements but the last patch may remain open as further offline debug is continuing to pin down the src of the above fence-signal-flow. For now we are hoping to proceed with reviewing the first two patches and only look into the 3rd patch if there are system level fence signalling that truly can trigger this anomaly or if its just a straddling request somewhere within i915 that has appeared or hung at the wrong time which needs to be fixed. alan:snip