On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote: > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote: > > Thanks for taking the time to review this Tvrtko, replies inline below. alan:snip > > > > > > Main concern is that we need to be sure there are no possible > > > ill-effects, like letting the GPU/GuC scribble on some memory we > > > unmapped (or will unmap), having let the suspend continue after timing > > > out, and not perhaps doing the forced wedge like wait_for_suspend() does > > > on the existing timeout path. > > alan: this will not happen because the held wakeref is never force-released > > after the timeout - so what happens is the kernel would bail the suspend. > > How does it know to fail the suspend when there is no error code > returned with this timeout? Maybe a stupid question.. my knowledge of > suspend-resume paths was not great even before I forgot it all. alan:Tvrtko, you and I both sir. (apologies for the tardy response yet again busy week). So i did trace back the gt->wakeref before i posted these patches and discovered that runtime get/put call, i believe that the first 'get' leads to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging off i915). (ofc, there is a corresponding for '_put_last') - so non-first, non-last increases the counter for the gt... but this last miss will mean kernel knows i915 hasnt 'put' everything. alan:snip > > > > Recap: so in both cases (original vs this patch), if we had a buggy gt-wakeref leak, > > we dont get invalid guc-accesses, but without this patch, we wait forever, > > and with this patch, we get some messages and eventually bail the suspend. > > It is not possible to wait for lost G2H in something like > intel_uc_suspend() and simply declare "bad things happened" if it times > out there, and forcibly clean it all up? (Which would include releasing > all the abandoned pm refs, so this patch wouldn't be needed.) > alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref. As we already know, what we do know from a uc-perspective: - ensure the outstanding guc related workers is flushed which we didnt before (addressed by patch #1). - any further late H2G-SchedDisable is not leaking wakerefs when calling H2G and not realizing it failed (addressed by patch #2). - (we already), "forcibly clean it all up" at the end of the intel_uc_suspend when we do the guc reset and cleanup all guc-ids. (pre-existing upstream code) - we previously didnt have a coherrent guarantee that "this is the end" i.e. no more new request after intel_uc_suspend. I mean by code logic, we thought we did (thats why intel_uc_suspend ends wth a guc reset), but we now know otherwise. So we that fix by adding the additional rcu_barrier (also part of patch #2). That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have a future racy gt-wakeref late-leak somewhere - no matter which subsystem took it (guc is not the only subsystem taking gt wakerefs), we at least don't hang forever in this code. Ofc, based on that, even without patch-3 i am confident the issue is resolved anyway. So we could just drop patch-3 is you prefer? ...alan