Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 14/11/2023 19:48, Teres Alexis, Alan Previn wrote:
On Tue, 2023-11-14 at 17:52 +0000, Tvrtko Ursulin wrote:
On 14/11/2023 17:37, Teres Alexis, Alan Previn wrote:
On Tue, 2023-11-14 at 17:27 +0000, Tvrtko Ursulin wrote:
On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:

alan:snip

alan: So i did trace back the gt->wakeref before i posted these patches and
see that within these runtime get/put calls, i believe the first 'get' leads
to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get
helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging off
i915_device). (naturally there is a corresponding mirros for the '_put_last').
So this means the first-get and last-put lets the kernel know. Thats why when
i tested this patch, it did actually cause the suspend to abort from kernel side
and the kernel would print a message indicating i915 was the one that didnt
release all refs.

Ah that would be much better then.

Do you know if everything gets resumed/restored correctly in that case
or we would need some additional work to maybe early exit from callers
of wait_for_suspend()?
alan: So assuming we are still discussing about a "potentially new
future leaked-wakeref bug" (i.e. putting aside the fact that
Patch #1 + #2 resolves this specific series' bug), based on the
previous testing we did, after this timeout-bail trigger,
the suspend flow bails and gt/guc operation does actually continue
as normal. However, its been a long time since we tested this so
i am not sure of how accidental-new-future bugs might play to this
assumption especially if some other subsystem that leaked the rpm
wakref but that subsystem did NOT get reset like how GuC is reset
at the end of suspend.


What I would also ask is to see if something like injecting a probing
failure is feasible, so we can have this new timeout exit path
constantly/regularly tested in CI.
alan: Thats a good idea. In line with this, i would like to point out that
rev6 of this series has been posted but i removed this Patch #3. However i did
post this Patch #3 as a standalone patch here: https://patchwork.freedesktop.org/series/126414/
as i anticipate this patch will truly help with future issue debuggability.

That said, i shall post a review on that patch with your suggestion to add
an injected probe error for the suspend-resume flow and follow up on that one
separately.

Cool! I don't know exactly how to do it but if we come up with way and so gain IGT coverage then I am okay with the patch in principle.

Like perhaps some new debugfs api needs to be added to provoke the timeout error path on suspend, or something.

Regards,

Tvrtko



Regards,

Tvrtko

alan: Anyways, i have pulled this patch out of rev6 of this series and created a
separate standalone patch for this patch #3 that we review independently.





[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux