Re: [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Thu, 28 Sep 2023 13:46:10 +0100





On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
Thanks for taking the time to review this Tvrtko, replies inline below.

On Wed, 2023-09-27 at 10:02 +0100, Tvrtko Ursulin wrote:
On 26/09/2023 20:05, Alan Previn wrote:
When suspending, add a timeout when calling
intel_gt_pm_wait_for_idle else if we have a lost
G2H event that holds a wakeref (which would be
indicative of a bug elsewhere in the driver),
driver will at least complete the suspend-resume
cycle, (albeit not hitting all the targets for
low power hw counters), instead of hanging in the kernel.
alan:snip

   {
+	int timeout_ms = CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT ? : 10000;

CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT is in ns so assigning it to _ms is
a bit to arbitrary.

Why not the existing I915_GT_SUSPEND_IDLE_TIMEOUT for instance?
alan: good catch, my bad. However I915_GT_SUSPEND_IDLE_TIMEOUT is only half a sec
which i feel is too quick. I guess i could create a new #define or a multiple
of I915_GT_SUSPEND_IDLE_TIMEOUT?

   	/*
   	 * On rare occasions, we've observed the fence completion trigger
   	 * free_engines asynchronously via rcu_call. Ensure those are done.
@@ -308,7 +309,10 @@ static void wait_for_suspend(struct intel_gt *gt)
   		intel_gt_retire_requests(gt);
   	}
   
-	intel_gt_pm_wait_for_idle(gt);
+	/* we are suspending, so we shouldn't be waiting forever */
+	if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIMEDOUT)
+		gt_warn(gt, "bailing from %s after %d milisec timeout\n",
+			__func__, timeout_ms);

Does the timeout in intel_gt_pm_wait_timeout_for_idle always comes in
pair with the timeout first in intel_gt_wait_for_idle?
alan: Not necessarily, ... IIRC in nearly all cases, the first wait call
complete in time (i.e. no more ongoing work) and the second wait
does wait only if the last bit of work 'just-finished', then that second
wait may take a touch bit longer because of possible async gt-pm-put calls.
(which appear in several places in the driver as part of regular runtime
operation including request completion). NOTE: this not high end workloads
so the

Also, is the timeout here hit from the intel_gt_suspend_prepare,
intel_gt_suspend_late, or can be both?

Main concern is that we need to be sure there are no possible
ill-effects, like letting the GPU/GuC scribble on some memory we
unmapped (or will unmap), having let the suspend continue after timing
out, and not perhaps doing the forced wedge like wait_for_suspend() does
on the existing timeout path.
alan: this will not happen because the held wakeref is never force-released
after the timeout - so what happens is the kernel would bail the suspend.

How does it know to fail the suspend when there is no error code 
returned with this timeout? Maybe a stupid question.. my knowledge of 
suspend-resume paths was not great even before I forgot it all.

Would it be possible to handle the lost G2H events directly in the
respective component instead of here? Like apply the timeout during the
step which explicitly idles the CT for suspend (presumably that
exists?), and so cleanup from there once declared a lost event.
alan: So yes, we don't solve the problem with this patch - Patch#2
is addressing the root cause - and verification is still ongoing - because its
hard to thoroughly test / reproduce. (i.e. Patch 2 could get re-rev'd).

Intent of this patch, is to be informed that gt-pm wait failed in prep for
suspending and kernel will eventually bail the suspend, that's all.
Because without this timed-out version of gt-pm-wait, if we did have a leaky
gt-wakeref, kernel will hang here forever and we will need to get serial
console or ramoops to eventually discover the kernel's cpu hung error with
call-stack pointing to this location. So the goal here is to help speed up
future debug process (if let's say some future code change with another
async gt-pm-put came along and caused new problems after Patch #2 resolved
this time).

Recap: so in both cases (original vs this patch), if we had a buggy gt-wakeref leak,
we dont get invalid guc-accesses, but without this patch, we wait forever,
and with this patch, we get some messages and eventually bail the suspend.

It is not possible to wait for lost G2H in something like 
intel_uc_suspend() and simply declare "bad things happened" if it times 
out there, and forcibly clean it all up? (Which would include releasing 
all the abandoned pm refs, so this patch wouldn't be needed.)

Regards,

Tvrtko