Re: [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Wed, 25 Oct 2023 13:58:48 +0100

On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
Thanks for taking the time to review this Tvrtko, replies inline below.
alan:snip

Main concern is that we need to be sure there are no possible
ill-effects, like letting the GPU/GuC scribble on some memory we
unmapped (or will unmap), having let the suspend continue after timing
out, and not perhaps doing the forced wedge like wait_for_suspend() does
on the existing timeout path.
alan: this will not happen because the held wakeref is never force-released
after the timeout - so what happens is the kernel would bail the suspend.

How does it know to fail the suspend when there is no error code
returned with this timeout? Maybe a stupid question.. my knowledge of
suspend-resume paths was not great even before I forgot it all.
alan:Tvrtko, you and I both sir. (apologies for the tardy response yet again busy week).
So i did trace back the gt->wakeref before i posted these patches and discovered that
runtime get/put call, i believe that the first 'get' leads to __intel_wakeref_get_first
which calls intel_runtime_pm_get via rpm_get helper and eventually executes a
pm_runtime_get_sync(rpm->kdev); (hanging off i915). (ofc, there is a corresponding
for '_put_last') - so non-first, non-last increases the counter for the gt...
but this last miss will mean kernel knows i915 hasnt 'put' everything.

alan:snip

Recap: so in both cases (original vs this patch), if we had a buggy gt-wakeref leak,
we dont get invalid guc-accesses, but without this patch, we wait forever,
and with this patch, we get some messages and eventually bail the suspend.

It is not possible to wait for lost G2H in something like
intel_uc_suspend() and simply declare "bad things happened" if it times
out there, and forcibly clean it all up? (Which would include releasing
all the abandoned pm refs, so this patch wouldn't be needed.)

alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref
check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref.

As we already know, what we do know from a uc-perspective:
-  ensure the outstanding guc related workers is flushed which we didnt before
(addressed by patch #1).
- any further late H2G-SchedDisable is not leaking wakerefs when calling H2G
and not realizing it failed (addressed by patch #2).
- (we already), "forcibly clean it all up" at the end of the intel_uc_suspend
when we do the guc reset and cleanup all guc-ids. (pre-existing upstream code)
- we previously didnt have a coherrent guarantee that "this is the end" i.e. no
more new request after intel_uc_suspend. I mean by code logic, we thought we did
(thats why intel_uc_suspend ends wth a guc reset), but we now know otherwise.
So we that fix by adding the additional rcu_barrier (also part of patch #2).

It is not clear to me from the above if that includes cleaning up the 
outstanding CT replies or no. But anyway..

That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have
a future racy gt-wakeref late-leak somewhere - no matter which subsystem
took it (guc is not the only subsystem taking gt wakerefs), we at
least don't hang forever in this code. Ofc, based on that, even without
patch-3 i am confident the issue is resolved anyway.
So we could just drop patch-3 is you prefer?

.. given this it does sound to me that if you are confident patch 3 
isn't fixing anything today that it should be dropped.

Regards,

Tvrtko