On Tue, 2023-11-14 at 17:52 +0000, Tvrtko Ursulin wrote: > On 14/11/2023 17:37, Teres Alexis, Alan Previn wrote: > > On Tue, 2023-11-14 at 17:27 +0000, Tvrtko Ursulin wrote: > > > On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote: > > > > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote: > > > > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote: > > > > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote: > > > > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote: > alan:snip > > alan: So i did trace back the gt->wakeref before i posted these patches and > > see that within these runtime get/put calls, i believe the first 'get' leads > > to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get > > helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging off > > i915_device). (naturally there is a corresponding mirros for the '_put_last'). > > So this means the first-get and last-put lets the kernel know. Thats why when > > i tested this patch, it did actually cause the suspend to abort from kernel side > > and the kernel would print a message indicating i915 was the one that didnt > > release all refs. > > Ah that would be much better then. > > Do you know if everything gets resumed/restored correctly in that case > or we would need some additional work to maybe early exit from callers > of wait_for_suspend()? alan: So assuming we are still discussing about a "potentially new future leaked-wakeref bug" (i.e. putting aside the fact that Patch #1 + #2 resolves this specific series' bug), based on the previous testing we did, after this timeout-bail trigger, the suspend flow bails and gt/guc operation does actually continue as normal. However, its been a long time since we tested this so i am not sure of how accidental-new-future bugs might play to this assumption especially if some other subsystem that leaked the rpm wakref but that subsystem did NOT get reset like how GuC is reset at the end of suspend. > > What I would also ask is to see if something like injecting a probing > failure is feasible, so we can have this new timeout exit path > constantly/regularly tested in CI. alan: Thats a good idea. In line with this, i would like to point out that rev6 of this series has been posted but i removed this Patch #3. However i did post this Patch #3 as a standalone patch here: https://patchwork.freedesktop.org/series/126414/ as i anticipate this patch will truly help with future issue debuggability. That said, i shall post a review on that patch with your suggestion to add an injected probe error for the suspend-resume flow and follow up on that one separately. > > Regards, > > Tvrtko > > > alan: Anyways, i have pulled this patch out of rev6 of this series and created a > > separate standalone patch for this patch #3 that we review independently. > >