Re: [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 14 Nov 2023 17:27:18 +0000

On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
alan:snip
It is not possible to wait for lost G2H in something like
intel_uc_suspend() and simply declare "bad things happened" if it times
out there, and forcibly clean it all up? (Which would include releasing
all the abandoned pm refs, so this patch wouldn't be needed.)

alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref
check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref.

As we already know, what we do know from a uc-perspective:
-  ensure the outstanding guc related workers is flushed which we didnt before
(addressed by patch #1).
- any further late H2G-SchedDisable is not leaking wakerefs when calling H2G
and not realizing it failed (addressed by patch #2).
- (we already), "forcibly clean it all up" at the end of the intel_uc_suspend
when we do the guc reset and cleanup all guc-ids. (pre-existing upstream code)
- we previously didnt have a coherrent guarantee that "this is the end" i.e. no
more new request after intel_uc_suspend. I mean by code logic, we thought we did
(thats why intel_uc_suspend ends wth a guc reset), but we now know otherwise.
So we that fix by adding the additional rcu_barrier (also part of patch #2).

It is not clear to me from the above if that includes cleaning up the
outstanding CT replies or no. But anyway..
alan: Apologies, i should have made it clearer by citing the actual function
name calling out the steps of interest: So if you look at the function
"intel_gt_suspend_late", it calls "intel_uc_suspend" which in turn calls
"intel_guc_suspend" which does a soft reset onto guc firmware - so after that
we can assume all outstanding G2Hs are done. Going back up the stack,
intel_gt_suspend_late finally calls gt_sanitize which calls "intel_uc_reset_prepare"
which calls "intel_guc_submission_reset_prepare" which calls
"scrub_guc_desc_for_outstanding_g2h" which does what we are looking for for all
types of outstanding G2H. As per prior review comments, besides closing the race
condition, we were missing an rcu_barrier (which caused new requests to come in way
late). So we have resolved both in Patch #2.

Cool, I read this as all known bugs are fixed by first two patches.

That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have
a future racy gt-wakeref late-leak somewhere - no matter which subsystem
took it (guc is not the only subsystem taking gt wakerefs), we at
least don't hang forever in this code. Ofc, based on that, even without
patch-3 i am confident the issue is resolved anyway.
So we could just drop patch-3 is you prefer?

.. given this it does sound to me that if you are confident patch 3
isn't fixing anything today that it should be dropped.
alan: I won't say its NOT fixing anything - i am saying it's not fixing
this specific bug where we have the outstanding G2H from a context destruction
operation that got dropped. I am okay with dropping this patch in the next rev
but shall post a separate stand alone version of Patch3 - because I believe
all other i915 subsystems that take runtime-pm's DO NOT wait forever when entering
suspend - but GT is doing this. This means if there ever was a bug introduced,
it would require serial port or ramoops collection to debug. So i think such a
patch, despite not fixing this specific bug will be very helpful for debuggability
of future issues. After all, its better to fail our suspend when we have a bug
rather than to hang the kernel forever.

Issue I have is that I am not seeing how it fails the suspend when 
nothing is passed out from *void* wait_suspend(gt). To me it looks the 
suspend just carries on. And if so, it sounds dangerous to allow it to 
do that with any future/unknown bugs in the suspend sequence. Existing 
timeout wedges the GPU (and all that entails). New code just says "meh 
I'll just carry on regardless".

Regards,

Tvrtko